12 Jul 2022 Zhipeng Zhang & Dong Lin
The Apache Flink community is excited to announce the release of Flink ML 2.1.0! This release focuses on improving Flink ML’s infrastructure, such as Python SDK, memory management, and benchmark framework, to facilitate the development of performant, memory-safe, and easy-to-use algorithm libraries. We validated the enhanced infrastructure by implementing, benchmarking, and optimizing 10 new algorithms in Flink ML, and confirmed that Flink ML can meet or exceed the performance of selected algorithms from alternative popular ML libraries. In addition, this release added example Python and Java programs for each algorithm in the library to help users learn and use Flink ML.
With the improvements and performance benchmarks made in this release, we believe Flink ML’s infrastructure is ready for use by the interested developers in the community to build performant pythonic machine learning libraries.
We encourage you to download the release and share your feedback with the community through the Flink mailing lists or JIRA! We hope you like the new release and we’d be eager to learn about your experience with it.
API and Infrastructure
Supporting fine-grained per-operator memory management
Before this release, algorithm operators with internal states (e.g. the training data to be replayed for each round of iteration) store state data using the state-backend API (e.g. ListState). Such an operator either needs to store all data in memory, which risks OOM, or it needs to always store data on disk. In the latter case, it needs to read and de-serialize all data from disks repeatedly in each round of iteration even if the data can fit in RAM, leading to sub-optimal performance when the training data size is small. This makes it hard for developers to write performant and memory-safe operators.
This release enhances the Flink ML infrastructure with the mechanism to specify the amount of managed memory that an operator can consume. This allows algorithm operators to write and read data from managed memory when the data size is below the quota, and automatically spill those data that exceeds the memory quota to disks to avoid OOM. Algorithm developers can use this mechanism to achieve optimal algorithm performance as input data size varies. Please feel free to check out the implementation of the KMeans operator for example.
Improved infrastructure for developing online learning algorithms
A key objective of Flink ML is to facilitate the development of online learning applications. In the last release, we enhanced the Flink ML API with setModelData() and getModelData(), which allows users of online learning algorithms to transmit and persist model data as unbounded data streams. This release continues the effort by improving and validating the infrastructure needed to develop online learning algorithms.
Specifically, this release added two online learning algorithm prototypes (i.e. OnlineKMeans and OnlineLogisticRegression) with tests covering the entire lifecycle of using these algorithms. These two algorithms introduce concepts such as global batch size and model version, together with metrics and APIs to set and get those values. While the online algorithm prototypes have not been optimized for prediction accuracy yet, this line of work is an important step toward setting up best practices for building online learning algorithms in Flink ML. We hope more contributors from the community can join this effort.
Algorithm benchmark framework
An easy-to-use benchmark framework is critical to developing and maintaining performant algorithm libraries in Flink ML. This release added a benchmark framework that provides APIs to write pluggable and reusable data generators, takes benchmark configuration in JSON format, and outputs benchmark results in JSON format to enable custom analysis. An off-the-shelf script is provided to visualize benchmark results using Matplotlib. Feel free to check out this README for instructions on how to use this benchmark framework.
The benchmark framework currently supports evaluating algorithm throughput. In the future release, we plan to support evaluating algorithm latency and accuracy.
This release enhances the Python SDK so that operators in the Flink ML Python library can invoke the corresponding operators in the Java library. The Python operator is a thin-wrapper around the Java operator and delivers the same performance as the Java operator during execution. This capability significantly improves developer velocity by allowing algorithm developers to maintain both the Python and the Java libraries of algorithms without having to implement those algorithms twice.
This release continues to extend the algorithm library in Flink ML, with the focus on validating the functionalities and the performance of Flink ML infrastructure using representative algorithms in different categories.
Below are the lists of algorithms newly supported in this release, grouped by their categories:
- Feature engineering (MinMaxScaler, StringIndexer, VectorAssembler, StandardScaler, Bucketizer)
- Online learning (OnlineKmeans, OnlineLogisiticRegression)
- Regression (LinearRegression)
- Classification (LinearSVC)
- Evaluation (BinaryClassificationEvaluator)
Example Python and Java programs for these algorithms are provided on the Apache Flink ML website to help users learn and try out Flink ML. And we also provided example benchmark configuration files in the repo for users to validate Flink ML performance. Feel free to check out this README for instructions on how to run those benchmarks.
Please review this note for a list of adjustments to make and issues to check if you plan to upgrade to Flink ML 2.1.0.
This note discusses any critical information about incompatibilities and breaking changes, performance changes, and any other changes that might impact your production deployment of Flink ML.
Flink dependency is changed from 1.14 to 1.15.
This change introduces all the breaking changes listed in the Flink 1.15 release notes.
Release Notes and Resources
Please take a look at the release notes for a detailed list of changes and new features.
List of Contributors
The Apache Flink community would like to thank each one of the contributors that have made this release possible:
Yunfeng Zhou, Zhipeng Zhang, huangxingbo, weibo, Dong Lin, Yun Gao, Jingsong Li and mumuhhh.