Posts

It's Time to Bring Unified Stream-Batch Processing Engines to Mass Adoption

Note: This article was translated from Chinese. Some technical terms and concepts may differ from the original English terminology. ℹ️ The original article posted in zhihu @ 2023-02-20 This is an article that combines a decade of personal learning and growth to understand the development and iteration of unified stream-batch processing engines. The author, starting as an oblivious undergraduate student, observed the development of big data systems, gradually participated in it, and eventually became a committer in the Apache Flink community, following a spiral upward cognitive journey: starting with MapReduce batch processing, then developing machine learning libraries with Spark’s convenient and powerful batch processing capabilities; promoting Spark’s micro-batch-based real-time computing capabilities at Microsoft, then participating in Flink’s real-time computing development and promotion at Alibaba, moving from offline batch processing to real-time online processing, and after leaving Alibaba, promoting unified stream-batch processing engines within the company again. As the elders say: personal struggle is certainly important, but it’s also necessary to align with the course of history.

September 14, 2025

It's Time to Conclude the Discussion on Stream-Batch Unification in the Data Warehouse Field - Incremental Data Warehouse Series Part II

ℹ️ This article was originally published on zhihu @ 2024-03-27 📝 Note: This article was translated from Chinese. Some technical terms and concepts may differ from the original English terminology. Continuing from the Previous Article (Picking up where we left off - apologies for the delay between articles due to work commitments) Cost Issues of Near Real-Time Offline Data Warehouses - Incremental Data Warehouse Series Part I

September 14, 2025

Cost Issues of Near Real-Time Offline Data Warehouses - Incremental Data Warehouse Series Part I

ℹ️ This article was originally published on zhihu @ 2023-10-07 📝 Note: This article was translated from Chinese. Some technical terms and concepts may differ from the original English terminology. Demand for Near Real-Time Offline Data Warehouses The offline data warehouse, especially the Spark + Hive computing and storage architecture, has undergone more than ten years of development and industry validation, becoming the de facto standard in the industry. However, with the industry’s increasing demand for data timeliness, a real-time computing and storage architecture based on Flink + various types of storage has gradually developed. Due to different usage scenarios, costs, and data processing accuracy, this has led to the widespread use of the Lambda architecture in the industry to this day. (Interestingly, Hive, Spark, and Flink respectively won the SIGMOD system awards in 2018, 2022, and 2023.)

September 14, 2025

Those Fading Stream Processing Engines

ℹ️ This article was originally published on zhihu @ 2023-03-06 📝 Note: This article was translated from Chinese. Some technical terms and concepts may differ from the original English terminology. A generation will eventually grow old, but there are always young people — “The Train Drives Toward the Clouds, Dreams Rest Peacefully in the Ninth Heaven”

September 14, 2025