As champions for Apache Flink, we are excited for the 2.0 release and all that it will bring. Apache Flink 1.0 was released in 2016, and while we don’t have an exact release date, it looks like 2.0 will be released in late 2024/early 2025. Version 1.2 was just released in August 2024. Version 2.0 is set to be a major milestone release, marking a significant evolution in the stream processing framework. This blog runs down some of the key features and changes coming in Flink 2.0.

Disaggregated State Storage and Management

One of the most exciting features of Flink 2.0 is the introduction of disaggregated state storage and management. It will utilize a Distributed File System (DFS) as the primary storage for state data. This architecture separates compute and storage resources, addressing key scalability and performance needs for large-scale, cloud-native data processing.

Core Advantages of Disaggregated State Storage

  1. Improved Scalability
    By decoupling storage from compute resources, Flink can manage massive datasets—into the hundreds of terabytes—without being constrained by local storage. This separation enables efficient scaling in containerized and cloud environments.
  2. Enhanced Recovery and Rescaling
    The new architecture supports faster state recovery on job restarts, efficient fault tolerance, and quicker job rescaling with minimal downtime. Key components include shareable checkpoints and LazyRestore for on-demand state recovery.
  3. Optimized I/O Performance
    Flink 2.0 uses asynchronous execution and grouped remote state access to minimize the latency impact of remote storage. A hybrid caching mechanism can improve cache efficiency, providing up to 80% better throughput than traditional file-level caching.
  4. Improved Batch Processing
    Disaggregated state storage enhances batch processing by better handling large state data and integrating batch and stream processing tasks, making Flink more versatile across diverse workloads.
  5. Dynamic Resource Management
    The architecture enables flexible resource allocation, minimizing CPU and network usage spikes during maintenance tasks like compaction and cleanup.

API and Configuration Changes

Several API and configuration changes will be introduced, including:

  • Removal of deprecated APIs, including the DataSet API and Scala versions of DataStream and DataSet APIs
  • Deprecation of the legacy SinkFunction API in favor of the Unified Sink API
  • Overhaul of the configuration layer, enhancing user-friendliness and maintainability
  • Introduction of new abstractions such as Materialized Tables in v1.2 and further enhanced in v2
  • Updates to configuration options, including proper type usage (e.g., Duration, Enum, Int)

Modernization and Unification

Flink 2.0 aims to further unify batch and stream processing:

  • Modernization of legacy components, such as replacing the legacy SinkFunction with the new Unified Sink API
  • Enhanced features that combine batch and stream processing seamlessly
  • Improvements to Adaptive Batch Execution for optimizing logical and physical plans

Performance Improvements

The community is working on making Flink's performance on bounded streams (batch use cases) competitive with dedicated batch processors. This can further simplify your data processing stack.

  • Dynamic Partition Pruning (DPP) to minimize I/O costs
  • Runtime Filter to reduce I/O and shuffle costs
  • Operator Fusion CodeGen to improve query execution performance

Cloud-Native Focus

Flink 2.0 is being designed with cloud-native architectures in mind:

  • Improved efficiency in containerized environments
  • Better scalability for large state sizes
  • More efficient fault tolerance and faster rescaling

This is an exciting time for Apache Flink 2.0. It represents a significant leap forward in unified batch and stream processing, focusing on cloud-native architectures, improved performance, and streamlined APIs. These changes aim to address the evolving needs of data-driven applications and set new standards for what's possible in data processing. DeltaStream is proudly powered by Apache Flink, which makes it easy to start running Flink in minutes. Get a free trial of DeltaStream and see for yourself.