Cloud Dataflow Beta
Build, deploy, and run data processing pipelines that scale to solve your key business challenges. Google Cloud Dataflow enables reliable execution for large-scale data processing scenarios such as ETL, analytics, real-time computation, and process orchestration.
Features
Unified programming model
Cloud Dataflow provides unified programming primitives for both batch and stream-based data analysis. Powerful windowing semantics enable intuitive temporal processing patterns that address a wide range of data processing scenarios, like session analysis, anomaly detection, and funnel analysis.
Managed scaling
As a managed service, Cloud Dataflow fully manages the lifecycle of required compute resources, in order to reduce the burden related to resource management and cluster operations. Cloud Dataflow can horizontally auto-scale compute resources to achieve the needed throughput level and can automatically re-partition work to optimize resource utilization.
Reliable & consistent processing
Cloud Dataflow provides built-in support for fault-tolerant execution that is consistent and correct regardless of data size, cluster size, processing pattern or pipeline complexity. Developers can focus on writing business logic instead of handling control plane exceptions from hardware and network failures, or tuning execution to accomodate inputs.
Open source
Google has made the Java-based Cloud Dataflow SDK available as open source. This SDK allows the Cloud Dataflow programming model to be widely used, so that all developers can benefit from the productivity of writing simple and extensible data processing pipelines that can describe both stream and batch processing tasks.
Built for the cloud
From the ground up, Cloud Dataflow is built on and for the cloud. Cloud Dataflow workers run on stock Google Compute Engine instances, providing developers with an operationally familiar and cost-effective compute environment. Cloud Dataflow integrates with Cloud Storage, Cloud Pub/Sub and BigQuery for seamless data processing.
Monitoring
Integrated into the Google Developers Console, Cloud Dataflow provides statistics such as pipeline throughput and lag, as well as worker log inspection—all in near-real time. The monitoring console mirrors the processing logic of the pipeline, enabling developers to easily understand pipeline execution.