Thursday, 28 November 2019

Is Apache Hudi (Uber Hoodie) a Game Changer for BigData Workloads?

Motivation

Companies like Uber scan petabyte scale of data after every few hours and getting this done in timely manner has been a challenge for Uber and BigData community for years. Lambda architecture promised a way of processing massive quantities of data (i.e. “Big Data”) and providing access to batch-processing and stream-processing methods with a hybrid approach.

Figure 1: Lambda architecture requires double compute and double serving.
However, the fundamental tradeoff between data ingest latency, scan performance, and compute resources and operational complexity remained unavoidable. (Note: many other architectures tried addressing fundamental problem of BigData workloads as well)

But for workloads that can tolerate latencies of about 10 minutes, there is no need for a separate “speed” serving layer if there is a faster way to ingest and prepare data in HDFS. This unifies the serving layer and reduces the overall complexity and resource usage significantly.

Greetings from Apache Hudi 

Hudi (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing.


Is Apache Hudi (Uber Hoodie) a Game Changer for BigData Workloads?

Motivation Companies like Uber scan petabyte scale of data after every few hours and getting this done in timely manner has been a challe...