Hide

Features

Unified programming model

Cloud Dataflow provides unified programming primitives for both batch and stream-based data analysis. Powerful windowing semantics enable intuitive temporal processing patterns that address a wide range of data processing scenarios, like session analysis, anomaly detection, and funnel analysis.

Managed scaling

As a managed service, Cloud Dataflow fully manages the lifecycle of required compute resources, in order to reduce burden related to resource management and cluster operations. Cloud Dataflow can horizontally auto-scale compute resources to achieve needed throughput level and can automatically re-shard work to optimize utilization of resources.

Reliable & consistent processing

Cloud Dataflow provides built-in support for fault-tolerant execution that is consistent and correct regardless of data size, cluster size, processing pattern or pipeline complexity. Developers can focus on writing business logic instead of handling control plane exceptions from hardware and network failures or having to tune data input sizes.

Open source

Google has made the Java-based Cloud Dataflow SDK available in open source. This SDK allows the Cloud Dataflow programming model to be widely used, so that all developers can benefit from the productivity of writing simple and extensible data processing pipelines which can describe both stream and batch processing tasks.

Built for the cloud

From the ground up Cloud Dataflow is built on and for the cloud. Cloud Dataflow worker resources run on stock Google Compute Engine instances providing developers a familiar operational and cost-effective compute environment. Cloud Dataflow integrates with Cloud Storage, Cloud Pub/Sub and BigQuery for seamless data processing.

Monitoring

Integrated into the Google Cloud developers console, Cloud Dataflow monitoring provides lifecycle statistics including in flight information like real time pipeline throughput, real time step lag and real time worker log inspection. The monitoring console mirrors the processing logic of the pipeline, enabling developers to easily understand pipeline execution.