Paper review: Hive and Pig : umbrant

Paper review: Hive and Pig

October 9, 2011

This is a paper review of "Hive: Data Warehousing & Analytics on Hadoop" by the Facebook Data Team (a set of slides), and "Pig Latin" A Not-So-Foreign Language for Data Processing" by Olston et al. The Hive slide deck I believe is from 2009, and Pig was published at SIGMOD in 2008. I supplemented this with the Hive paper published at VLDB in 2009.

These are Facebook and Yahoo's approaches to higher-level languages that compile down to MapReduce on Hadoop. Measured by the percentage of Hive and Pig jobs on their production clusters, they have both been extremely successful. Hive takes a traditional SQL/database-like approach, while Pig looks more imperative. At face value they seem quite different, but there are actually a bunch of underlying similarities.

Hive main ideas

Hive is effectively a traditional database that just uses HDFS and MapReduce for data storage and query execution. Tables are serialized and deserialized to files in HDFS, and can be partitioned across and within fields. The query language, HiveQL, looks exactly like SQL minus some of the more complicated operators because of engineering effort and the limitations of MapReduce. UDFs are also supported, meaning that normal MapReduce code can be slid right in. HiveQL is compiled down into a MR query plan which can consist of multiple MR jobs. The logical plan is optimized by a rule-based optimizer (future work being an adaptive cost-based optimizer).

Queries can be fed to the server via a Thrift server, which enables Hive usage from a variety of different programming languages. A small note is that table metadata is stored outside of HDFS, in a normal database. This is simply because the amount of data is small, and the access pattern is pretty random, making HDFS ill-suited.

Pig main ideas

Pig is designed explicitly for ad-hoc data analysis by programmers. The query language looks like Python with operators pulled from SQL, and instead of tables, users are given more programmer-friendly data structures like maps and lists. UDFs are also first-class citizens in Pig, and can have arbitrary inputs and outputs (non-atomic values).

All this means that the query language and data format are more flexible. Hive needs to the classic ETL (extract-transform-load) to get data into tables before it can query it. Pig, you just pass it a file and a function explaining how to interpret it. This likely comes at a performance cost, but using the standard deserializers and a schema would ameliorate this. Pig also allows for more explicit control over the query plan, since each stage in the execution DAG is as programmed.

As in Hive, Pig does not provide some operators because of the limitations of MapReduce. Also as in Hive, statements using the SQL-like operators can be optimized, and multiple MapReduce jobs are chained together for you automatically.

One thing I really like about Pig is the focus on debugging. Trying to reason about a page of SQL is really difficult, and it's much easier to reason about Pig's series of steps. It looks extremely similar to how I do ad-hoc text parsing in Python: gradually applying operators to collections of strings until I get the result I want. Pig also provides an "example execution table" that shows what the Pig program does on a small amount of data, which is much quicker than running the actual MapReduce jobs.

Analysis

It's handy that both Hive and Pig automatically string together MR jobs as part of one program, but you still pay the serialization overhead of writing things into intermediate files between jobs. This is something that isn't true with a more general execution framework like Dryad. The move towards more declarative languages, as I've said previously, isn't surprising at all, since actually programming a MapReduce job is way more work than using something more high-level and declarative. For ad-hoc queries, it's way better to optimize for programmer productivity than try to squeeze out that last 20% of performance from writing it in raw MR.

Hive has been extremely popular at Facebook, and I think the same is true of Pig at Yahoo. I think the future is going to be improving the underlying Hadoop execution engine to better support ad-hoc queries by keeping intermediate files in memory, and improving the number of operators and optimizers for both languages.