The GigaOM Stucture conference a couple of weeks ago addressed many areas of cloud computing. One of the key themes of the event was the emergence of new data architectures. Throughout the panels, interviews, and presentations many speakers identified significant changes in how data gets handled that will be coming.
Paul Maritz, CEO of VMWare, argued that the traditional roles of the operating system are managing system resources and providing services, and that in both areas it is being replaced. He said that virtualization technologies are extending beyond CPU and memory to cover areas such as storage and policy. He also noted that programming frameworks, like the Spring framework or Ruby frameworks (presumably Rails), are becoming the dominant means of getting services. Paul mentioned that VMWare is looking at buying other companies to offer frameworks outside of the Java arena. He argues these frameworks are important because cloud services are the new hardware - they are a block box you aren't allowed to see inside, so the framework layer allows portability and instrumentation for management. Beyond frameworks, they are investing heavily in queueing and data caching technologies - VMWare sees the whole "data stack" as being in great ferment.
At the VC Panel, Ping Li of Accel Partners, who have invested in Cloudera, said that they see a lot of opportunities around the emergence of a new data stack using NoSQL as well as analytics and OLAP that companies like Cloudera provide. Assuming Hadoop is in place, there's a need for additional analytics - just as web 2.0 companies required a new data layer, new cloud applications will need the same.
The event included a launch pad for new start-ups. Those related to changing data processing were:
- Datameer - providing tools for analysts to do big data analytics without coding in a spreadsheet metaphor
- NorthScale - launched the beta of their MemBase server, which extends Memcached with persistent storage
- Nephosity - wizards to configure Hadoop jobs instead of code
- Riptano - providing enterprise support and distributions for Apache Cassandra, a leading NoSQL key/value store
There was a panel on Scaling the Database in the Cloud with representation by 10gen, Neo Technology, Pervasive Software, Clustrix, Terracotta, and Northscale. The one area of agreement among this group was that traditional databases don't work well in a scalable cloud environment. Paul Mikesell of Clustrix took the point of view that a distributed database can fix the problem of a non-scalable implementations and that databases provide a single point of management, and can offer fungible resources. Others took the view that relational databases are harder to use for developers, and that alternative data formats and workloads are best served by different data storage mechanisms. Mike Hoskins of Pervasive Software noted that the death grip relational databases have had on all data problems is ending - that post-relational is a more important concept than NoSQL.
There was a lot of debate about whether SQL is useful or not: Roger Bodamer of 10gen argued that NoSQL is more natural for modeling domains, whereas relational schemas involve too many tables and relationships and require fantastic data modelers to make work. Emil Eifrem of Neo Technology argued that SQL is often the pain point - that developers hate SQL and use tools to avoid working with it. Likewise, Amit Pandey of Terracotta noted that developers are normally using Hibernate (for Java and .NET) as their abstraction layer for programming databases. Roger Bodamer of 10gen argued that for analysis, SQL is of great utility, but that NoSQL is useful for horizontal scalability of reads and writes where you don't need strong transactional coherence, and that there are a variety of database types that are suited for different purposes: graph databases, key/value stores, document stores, and column stores. He argued that MongDB is a leading document-based database, and said that users are asking operational questions as they put the technology into production: e.g., how do you back up the data, and what are best practice. Paul Mikesell agreed that there are different requirements between analysis databases and OLTP databases, with the latter requiring coherency and more concurrency which necessitates a row-based format, unlike column-based analysis databases.
On the question of how the cloud is changing databases, Paul Mikesell pointed to sharding as evidence of failure of single instance databases in the cloud and in local data centers too, and noted that Hadoop was doing a great job on the analytics side, whereas Clustrix and other companies are focusing on transactions. Amit Pandey argued that the weight of 25 years is killing the traditional database designs. Roger Bodamer added that scalable database implementations can tolerate failures like losing a rack, which makes them better suited to the cloud as well as having lower overhead from a fresh design. Mike Hoskins noted that a key question is how to get data in and out of the database, and that given the rich tools for loading data, reporting, and managing metadata that relational databases have, there will be a lot of gaps in non-traditional database implementations; in his view there is more of a problem dealing with analysis data and these gaps are more manageable than for transaction processing.
There was also a panel on Big Data with representation by Cloudera, SQLStream, NEC Labs, Yahoo, and Par Accel. When asked how you know if you are dealing with big data, Amr Awadallah of Cloudera said you know when you are constantly buying new disks and archiving tapes. Damian Black of SQLStream said you know when you have indigestion and you're unable to keep up with the pace of arriving data. Hagan Hacigumus of NEC Labs said you have "bigger data" when your existing data management is falling apart. Todd Papaioannou of Yahoo sees big data as more related to data composition - it is "gobs and gobs" of unstructured or semi-structured data, where you are looking to discover value. Barry Zane of Par Accel sees it as being when you have so much information that getting answers to your questions in a reasonable time frame is a daunting task, e.g., for clickstream analysis.
The panelists were asked for examples of real world use. Amr Awadallah said eBay have the third largest Hadoop cluster in existence holding a few petabytes of data and move data between it and a traditional data warehouse. The main value of the eBay Hadoop cluster is allowing people to run complex algorithms like new ways of matching products to people, computing rankings, and fraud detection. He also noted that the Apollo Group (the University of Phoenix parent company) have a very large Hadoop cluster and use it to analyze how students interact with online content to optimize content delivery. Damian Black said the Australian government is working on a project to monitor all vehicles on freeways and to dynamically set speed limits, which needs to be done in realtime to avoid compression waves that cause traffic jams. Barry Zane noted that Fidelity National Information Services are correlating credit card activities to detect new methods of fraud. Todd Papaioannou noted that Yahoo analyzes 45 billion events a day to target content to user interests, to perform behavior targeting of ads, spam filtering, and machine learning.
When asked how much of the Fortune 1000 have big data problems they are addressing now and will be in three years, Hagan Hacigumus said about 30% of them need non-traditional technology to solve data problems. Amr Awadallah said that they all have the problems, they just don't all realize it yet. For example, archiving data means moving data into a tape graveyard unless the government asks you to get it back. And that there's a sore need for data consolidation instead of having data scattered across 20-30 databases. Todd Papaioannou says they all have the problem but they haven't figured out how much they will pay to solve it. Barry Zane noted that their clients have a problem in mind but often are constrained by preconceived notions of they can do: he gave the example of a retailer that was benchmarking query performance but that was able to perform market basket analysis to discover affinity across all their products in minutes instead of days.
Erich Clementi of IBM noted that healthcare, government, financial services, and retail organizations are interested in big data analytics. He noted that IBM's internal sales data was transformed from 300 data marts and 40-50 applications. They eliminated the applications and consolidated all the marts into a single petabyte data mart, which over 100,000 people are accessing daily. He also noted that they are working with healthcare providers to provide an HIPAA-compliant cloud, which can reduce some of the 8-12% cost of clinical trials that's spent in moving data. While there can be competitive issues on some data sets, Erich Clementi saw significant opportunity in sharing data even among competitors, such as for drug discovery or allowing financial services firms to share data for fraud detection, even though they wouldn't share data to improve trading. Organizing multi-tenant clouds like this requires attention to security and privacy requirements, of course.
The theme of changing data architectures was noticeable throughout the conference, and from the perspective of many different companies, including large scale users of data as well as vendors. There was interest in NoSQL engines for lightweight data storage as well as in large scale data analysis solutions, notably using Hadoop to crunch data held in distributed file systems using MapReduce.