BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Big Data as a Service, an Interview with Google's William Vambenepe

Big Data as a Service, an Interview with Google's William Vambenepe

Many of the big data technologies in common use originated from Google and have become popular open source platforms, but now Google is bringing an increasing range of big data services to market as part of its Google Cloud Platform. InfoQ caught up with Google's William Vambenepe, who's lead product manager for big data services to ask him about the shift towards service based consumption.

InfoQ: Hadoop, HDFS and HBase were inspired by Google's MapReduce, GFS and Bigtable. How much does the service now called Bigtable differ from the internal platform of a decade ago?

William: Bigtable has gone through several major iterations in its lifetime at Google, driven from the evolving requirements from supporting Google’s major applications. In many ways, the Bigtable that is part of the bedrock at Google today is significantly different from the technology that was originally developed in 2004. For example, after its internal implementation, a significant amount of work was done on improving the 99th percentile latency which became a stronger and stronger requirement when Google started serving traffic out of the database. This drove a lot of work into diagnosing and grinding away the tail latency.

Additionally, multi-tenancy within Google has been a significant challenge, and in offering the technology as an external service, a lot of work had to take place around isolation of all the different layers of resources which are utilized. One last note is that this service is offered through the completely open source HBase API/client, which is somewhat ironic, given that Bigtable was the original service, and there is a tremendously powerful client infrastructure internally. However, we think it was the right thing to do, since the HBase community is diverse and powerful and we want to continue to work together with this amazing ecosystem.

InfoQ: Over the last few years Google has talked about a bunch of internal data services such as Dremel, MegaStore and Spanner; are they now finding their way into services that anybody can use on demand?

William: Definitely, many of them are. Many of the services we use inside Google are extraordinarily useful outside of Google. A clear example of this is BigQuery, which exposes Dremel as a service, allowing users to analyze potentially enormous (petabyte-sized) datasets with SQL queries which typically execute in just a few seconds and require no cluster management from the user.

Another example would be Cloud Datastore, which currently relies on Megastore to provide a NoSQL transactional database that can handle Google-scale data sets.

Beyond those, Google Cloud exposes other internal tools which have been described in published papers. For example, Cloud Dataflow unifies two internal tools, FlumeJava (for batch processing) and Millwheel (for stream processing) to provide a unified programming model and managed execution environment for both batch and stream.

And yes, we're always looking at situations where other Google technology could be exposed as a service -- and Spanner is definitely something that's generated a lot of interest in this area.

InfoQ: In tech we very often talk about better, faster, cheaper - pick two. Is it possible that cloud based big data services will offer all three (versus do it yourself approaches using open source or products)?

William: Let’s see...

Better?

Just a few examples: BigQuery performance at scale is unparalleled (we have customer queries which process several petabytes in a single query). Dataflow unifies batch and stream in one programming model and offers the most advanced semantics in the industry for stream processing (e.g. windowing by actual event time, not arrival time). Bigtable vastly outperforms comparable databases on read and write latency. Etc. And all those capabilities come as fully managed services, not at the cost of weeks of deployment/configuration/tuning.

Faster?

Clearly faster in terms of product performance (as mentioned above), but in this context I think "faster" refers to the speed at which an organization is able to move (which is the most important aspect in the end). In that sense, the fully-managed data services on Google Cloud allow organization to get results immediately. Because there is no setup required, what would normally be an "IT project" starting with capacity planning, provisioning, deployment, configuration, etc can fast-forward straight to the productive part.

Cheaper?

Google’s Big Data services allow users to pay only for what they consume, both in terms of storage and processing. So when you’re storing 16TB of data you pay for 16TB of data, you don’t have to provision (and pay for) extra storage to account for growth, you don’t have to double or triple this for redundancy (it happens under the cover), etc. Similarly for processing, you don’t need to pay to maintain idle resources when you’re not actively processing or querying your data. To this lower infrastructure cost you can add the savings of not having to deploy, manage, patch and generally administer data processing infrastructure.

So... yes. Pick three of three.

InfoQ: From a performance perspective the launch of Bigtable concentrated on write throughput and end of tail latency. Why are these the key metrics that developers and their users should care about?

William: Fundamentally, wide-column stores (Bigtable, HBase, Cassandra) are scale-out databases which provide true linear scalability and are designed to be used both as an operational and analytical backend. Because their value proposition relies on making large amounts of data available very quickly, the fundamental metrics revolve around volume and speed.

One of our concerns was that the NoSQL industry in general is still tossing around benchmarks focusing on the 50th percentile of latency, both on the read and write sides. At Google we think this is a bad practice. Being focused on the 50th percentile of performance means that half of your requests (and by extension potentially half of your customers) are getting an unbounded worse experience than you are testing for. Instead, Google focuses on the 99th or 99.9th percentile, meaning that we are characterizing the expected experience for a vast majority of our users. In this way we better understand what effects our architectural and configuration choices have.

With regards to throughput, the industries that we think can take tremendous advantage of Bigtable are ones where the ability to collect and store more data means better decisions at the application layer. More throughput on small infrastructure makes existing applications easier to create and bring to market, and the ability to more easily scale existing applications to drive better insights.

InfoQ: The disruption we've seen across the industry for the last few years seems to be (in economic terms) moving us from economic rents for (packaged software) suppliers to consumer surpluses for end users. Do you see the disrupted supplier gains accruing to service providers rather than say distribution support companies?

William: In addition to the savings from purchasing (and renewing) commercial software licenses users will see even more savings from a much more flexible consumption model (paying just for what they need), a much smaller administration overhead (fully managed services) and a generally more efficient infrastructure (leveraging the expertise and economies of scale of huge providers like Google). While some of it will be concretized as a displacement of revenue from packaged software licenses to Cloud providers, I think most of the gains for the move to Cloud will be accrued by users.

But even more than the cost savings, the benefits of Cloud for users will be about getting a lot more from IT, not just paying less for it. The Cloud model will give wider access to information within the company, it will revolutionize collaboration, and it will provide access to advanced tools (e.g. advanced machine learning) which would be very hard to implement on-premise but very easy to consume as a service.

InfoQ: Companies like Google seem to have put a lot more effort into building data services than underlying infrastructure. Does this mean that this shift to public cloud puts a lot more on the table for adopters than might be perceived by focusing just on IaaS (and PaaS/SaaS)? Have the NIST definitions for cloud channelled the conversation too much and blinded us to other aspects of 'as a service' delivery?

William: I wouldn’t say that Google has put more effort in data services than in underlying infrastructure. We’ve put huge efforts into infrastructure, even though they are less visible. Some of the greatness of the more-visible data services comes from the greatness of the underlying infrastructure. For example, we recently described some of our networking innovation. As I pointed out on Twitter, this phenomenal performance of the underlying infrastructure plays a large role in allowing higher-level services like BigQuery or Dataflow to shine (in both performance and cost).

The NIST definitions are useful for an initial pass and served well in categorizing providers in the early days of Cloud. But in practice, Cloud services provide a continuum of options and the IaaS/PaaS breakdown is a bit too simplistic. People who seriously consider their needs and options for Cloud usage have shown that they understand the value of consuming the highest-level services which are flexible enough for their needs. For example, Cloud Dataflow provides the most optimized fully managed environment for data processing pipelines. If it meets your functional needs, it’s the best operational choice but if it doesn’t (e.g. you want to use a language not supported on Dataflow) then you always have the option to go to a lower-level service and use Spark, Flink, or Hadoop on GCE. Most large-scale customers use a mix of services, at various levels of management, to optimally meet their various needs. It’s the job of the Cloud Platform to ensure those services are well integrated and can be combined seamlessly.

InfoQ: How has the state of the art evolved for large-scale processing since MapReduce started the category over ten years ago?

William: Quite a lot!

MapReduce opened the gate to a world of cost-efficient large-scale computing. Pretty quickly though, our internal usage of MapReduce at Google showed that writing optimized sequences of MapReduce steps for real-life use cases was complex, so we moved towards developing a higher-level API, which can be automatically transformed into optimized MapReduce. That was the original FlumeJava work (not related to Apache Flume). Then we moved on to skipping MapReduce altogether and running the pipeline as a DAG (Directed Acyclic Graph).

The open source world followed more or less the same path with a delay. Hadoop brought MapReduce to the world, then tools like Apache Crunch and Cascading provided a FlumeJava-like pipeline API, and Spark and Tez brought a DAG-centric execution engine.

At the same time that these batch processing technologies were becoming more refined, the need emerged to process large data streams in near real-time. Google originally developed this model as Millwheel, a separate processing engine.

With Cloud Dataflow, we are taking the next step, merging batch and streaming processing into a unified programming model and isolating the definition of the processing from the choice of how to run it (which execution engine, and whether applied to historical data or to an on-going stream). But this time, we’re doing it for everyone, as an open source SDK and as a publicly-available service.

About the Interviewee

William Vambenepe leads the Product Management team for Big Data services on Google Cloud Platform. Prior to Google, he was an Architect at Oracle and a Distinguished Technologist at HP.

 

 

Rate this Article

Adoption
Style

BT