On August 12, Google announced that its big data processing service has reached general availability. This managed service allows customers to build pipelines that manipulate data prior to being processed by big data solutions. Cloud Dataflow supports both streaming and batch programming in a unified model.
The Cloud Dataflow service is the evolution of some internal projects at Google including MapReduce, Flume and Millwheel. Google first released an early access preview of Cloud Dataflow at Google I/O in June 2014. This released was followed by alpha and beta releases in December 2014 and April 2015 respectively.
A common problem for organizations is the ability to process large amounts of data prior to being ingested into cloud or on-premises big data platforms. Data often times needs to undergo ETL-like operations such as enrichment, shaping, filtering, consolidation, computation and composition. Google’s Cloud Dataflow platform has been designed to address these challenges for customers within the Google cloud platform. Organizations that prefer to run their big data workloads on-premises can build their own data pipelines using Google’s SDK or via third party solution.
Customers may also use Google Dataflow in high volume computation scenarios where you need to process more data than your cluster’s memory footprint can handle. Using Cloud Dataflow developers can also break these jobs into perfectly parallel data processing tasks which can be executed concurrently and independent of each other.
Some of the benefits that Google claims include:
- A NoOps model that can allocate resources on-demand with intelligent auto scaling and automated work optimization.
- A unified, functional programming model that supports both batch and stream based processing.
- An extensible, open source SDK that enables custom scenarios for customers and allows for 3rd party integration.
Eric Schmidt, product manager at Google, classifies the Dataflow offering in the following two ways: “a collection of SDKs for building batch or streaming parallelized data processing pipelines and a fully managed service for executing optimized parallelized data processing pipelines".
Image Source: http://googlecloudplatform.blogspot.ca/2015/08/Announcing-General-Availability-of-Google-Cloud-Dataflow-and-Cloud-Pub-Sub.html
Organizations can use the Cloud Dataflow service as a method to ingest, transform and analyze data before dispersing it to other analytic services and platforms. These other integration points include Google's Big Query, Cloud Datastore and Cloud Pub/Sub messaging or 3rd party analytic services.
Google has on-boarded some partners including springML, Cloudera, dataArtisans and Salesforce.com to extend the platform offering. For example, customers using Cloud Dataflow and Salesforce.com Wave analytics will be able to analyze large amounts of data, regardless of the origin, using an end to end platform in order to optimize customer interactions.
For organizations preferring to build their own solutions, the open source SDK provides a specialized collection class called PCollection which is used to store your bounded and unbounded data collections. Google claims that a PCollection can reach a “virtually unlimited size” and when combined with a PTransform can be used in data transformations between your source and destination systems. I/O APIs support different file types including text, Avro files and Big Query tables which can be used to load data into a PCollection class.
Google Dataflow will be billed on a per job basis which accounts for a graph of computations provided by the developer, service time, work time and shuffled bytes. In addition to these costs, other Google services that are being consumed, such as Big Query, will be billed separately.
Google has some competition in this space including the likes of Amazon and Microsoft. Amazon offers their Kinesis platform while Microsoft addresses similar use cases using the Azure Data Factory platform.