Google Cloud Dataflow is a simple, flexible, and powerful system you can use to perform data processing tasks of any size.
Cloud Dataflow consists of two major components:
- A set of SDKs that you use to define data processing jobs. The Dataflow SDKs feature a unique programming model that simplifies the mechanics of large-scale cloud data processing. You can define your data processing jobs by writing programs using the Dataflow SDKs, such as the Dataflow SDK for Java.
- A Google Cloud Platform managed service. The Dataflow service ties together and fully manages several different Google Cloud Platform technologies, such as Google Compute Engine, Google Cloud Storage, and BigQuery to execute data processing jobs on Google Cloud Platform resources.
You can use both aspects of Dataflow together, using the Dataflow SDKs to create jobs to be executed by the Dataflow service. The Dataflow service also has plans to support third-party or open source SDKs.
Contents
Common Dataflow Use Cases
High Volume Computation: High volume computation tasks include:
- Tasks that need to process a majority of the bytes in the input data.
- Tasks that generate large amounts of output data, instead of a few small summary values.
- Tasks where the computation requires significant amounts of CPU time.
- Tasks where the computation requires complex custom code.
- Tasks where your input data will not fit into memory on a cost effective cluster.
An example of a high volume computation task might be producing resized and recompressed images for photos uploaded by users. Every byte of the input data would need to be read and analyzed, and there might be multiple output images produced in different sizes and formats for different purposes, leading to a large volume of output data. The resizing and recompression operations are complex, and would be difficult to express in a programming language or model that is not general purpose, and can't take advantage of existing libraries of code.
Workflow Synthesis: Rarely does the real world present us with a data processing task so simple that we can perform it in one or two easy to understand steps. Instead, we spend our time trying to manage highly complex workflows with many interconnected steps, with the output of one step becoming the input of the next step. Cloud Dataflow allows you to express these complex workflows using a straightforward programming model, then optimizes these workflows and executes them on a fully managed service, so you can focus on what you want to process, not how it should be done.
At this time Cloud Dataflow can use Google Cloud Storage and Google BigQuery as input and output sources, and workflow steps can be expressed using the Java SDK. In the future, you'll be able to integrate external steps, such as a Hadoop job, into your workflows, allowing you to start with what you have working today, but grow and manage your workflows using Dataflow.
Extract, Transform, Load (ETL): ETL is shorthand for the process of Extracting data from one or several sources, Transforming it into a more desireable format for analysis, and Loading it into a system where it can be analyzed effectively. As mentioned above, Dataflow excels at this sort of high volume data processing.
You need an ETL pipeline if:
- Your data is not in the format you want it.
- Your data is not where you need it to be.
ETL pipelines are often long and complex, with many steps, sometimes using propriety tools to extract data from a variety of data sources, which is exactly what Workflow Synthesis, discussed above, is all about. So ETL is really a specific example of an application that embodies both High Volume Computation and Workflow Synthesis.
Using the Dataflow SDKs
The Dataflow SDKs provide a simple and elegant programming model to express your data processing jobs. You use a Dataflow SDK to create a data processing pipeline. A pipeline is an independent entity that reads some input data, performs some transforms on that data to gain useful or actionable intelligence about it, and produces some resulting output data. A pipeline's transforms might include filtering, grouping, comparing, or joining data.
The Dataflow SDKs provide several useful abstractions that allow you to think about your data processing pipeline in a simple, logical way. Dataflow simplifies the mechanics of large-scale parallel data processing, freeing you from the need to manage orchestration details such as partitioning your data and coordinating individual workers.
Dataflow SDK Concepts
Key concepts in the Dataflow SDKs include the following:
- Simple data representation. Dataflow SDKs use a specialized collection class, called
PCollection, to represent your pipeline data. This class can represent data sets of virtually unlimited size, including bounded and unbounded data collections. - Powerful data transforms. Dataflow SDKs provide several core data transforms
that you can apply to your data. These transforms, called
PTransforms, are generic frameworks that apply functions that you provide across an entire data set, using the features of the Dataflow service to execute each transform in the most efficient way. - I/O APIs for a variety of data formats. Dataflow SDKs provide APIs that let your pipeline read and write data to and from a variety of formats and storage technologies. Your pipeline can read text files, Avro files, BigQuery tables, and more.
Open Source SDKs
Google has released the Dataflow Java SDK as open source, available on GitHub. This will help foster an ecosystem of open source projects around the Dataflow model, through the following benefits:
- Having the source code of the Dataflow Java SDK available provides visibility into exactly how your programs interact with the managed Dataflow service offered by Google.
- Open sourcing the Dataflow Java SDK makes it easier for others to provide additional runners for the Dataflow Java SDK that target different runtime environments, enabling Dataflow pipelines to be run on premesis or on non-Google Cloud services.
- Others will be able to learn from the Dataflow transforms provided as part of the Dataflow Java SDK making it easier for them to write and release new Dataflow transforms into the open source community.
What is the Dataflow Service?
The Dataflow service fully manages Google Cloud Platform resources to execute your data processing tasks. The Dataflow service ties together a number of Cloud Platform technologies, including:
- Google Compute Engine VMs, to provide job workers.
- Google Cloud Storage, for reading and writing data.
- Google BigQuery, for reading and writing data.
When using the Dataflow service, there's no need to manually shard or partition your data by hand; the service's automatic optimization and resource management systems automatically handle breaking your job down for the most efficient execution on the Cloud Platform resources you've allocated.
Service Features
Dynamic Optimization: The Dataflow service provides dynamic optimization of Cloud Platform resources to execute your data processing jobs. When you build a dataflow, the Dataflow service constructs a directed graph of your job and optimizes the graph for the most efficient execution.
Resource Management: The Dataflow service fully manages Cloud Platform technologies to run your job. This includes spinning up and tearing down Compute Engine resources, collecting logs, and communicating with Cloud Storage technologies.
Job Monitoring: The Dataflow service includes a monitoring interface built into the Google Developers Console. The Dataflow monitoring interface shows the different stages of your data processing pipeline and lets you see how data moves through those stages as the job progresses.
Native I/O Adapters for Cloud Storage Technologies: The Dataflow Service has built-in support for getting data from, and writing data to, Cloud Platform storage systems such as Cloud Storage and BigQuery. This makes it easy to build a data processing pipeline to work with your data in Cloud Platform.