Motivation
Companies like Uber scan petabyte scale of data after every few hours and getting this done in timely manner has been a challenge for Uber and BigData community for years. Lambda architecture promised a way of processing massive quantities of data (i.e. “Big Data”) and providing access to batch-processing and stream-processing methods with a hybrid approach.![]() |
However, the fundamental tradeoff between data ingest latency, scan performance, and compute resources and operational complexity remained unavoidable. (Note: many other architectures tried addressing fundamental problem of BigData workloads as well)
But for workloads that can tolerate latencies of about 10 minutes, there is no need for a separate “speed” serving layer if there is a faster way to ingest and prepare data in HDFS. This unifies the serving layer and reduces the overall complexity and resource usage significantly.
Greetings from Apache Hudi
Hudi (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing.
