Article

6 Things I Wish I Knew About Hive 2.0 before Starting My Job at LinkedIn

I recently joined the Hive team at LinkedIn, where I work on extending SQL support to Hive and improving query performance. Prior to joining the team, I worked for six years in a research lab storing massive datasets in a Hadoop cluster managed by a relational database. Since my transition from academia to industry, much of what I learned about running large-scale applications has changed — and so has much of how we approach problems when building software or systems that handle data at scale.

In this post, I’ll share some of the things that most surprised me in my first few months.

Linkdin

6 Things I Wish I Knew About Hive 2.0 before Starting My Job at LinkedIn

1. You can’t simply run a Hive query and inspect the results — you have to benchmark!

Running a query on a Hadoop cluster is noticeably different than running queries on an in-memory database such as Vertical or Parcel. The difference is mostly due to disk access, so when profiling your queries with EXPLAIN ANALYZE, you should be aware of whether your data fits entirely in memory (which will result in very fast queries) or not (in which case, expect slower performance). Also keep in mind that every time you add hardware to your cluster, it will take time for new nodes to catch up with old ones — that is, the newly added machines may perform worse than the rest of your cluster until they have caught up with it.

2. NoSQL databases can support ACID transactions.

While Hadoop is often used as a batch processing system for large-scale ETL pipelines or machine learning pipelines, many new systems such as Phoenix and Borealis support online operations where data engineers write Avro schemas and Hive queries to update existing datasets in place (a technique known as Incremental Streaming). These systems will be a critical part of LinkedIn’s infrastructure going forward since they allow us to enrich our stream of real-time updates from Streams with historic data from Kafka without having to reprocess all of our historic events again on every run.

3. Optimizing Hive queries means running them 100x first to test their performance impact.

Prior to working at LinkedIn, I was used to writing ad hoc queries in Hive that examined datasets of 100 GB or larger on Hadoop clusters containing tens of nodes. With my role now shifted towards performance engineering, I use the same queries against much larger datasets (tens of TBs) or even multiple times concurrently against production data. The trick knows what queries take more than 5 minutes so you can benchmark them before submitting them as part-time jobs on your cluster (and remember to time how long they take after every modification!). Also, keep in mind that you will need to wait for several hours before the query completes if there are any mistakes — and that it will only tell you the final result, not how many nodes were involved in processing your data.

4. Hive supports advanced SQL features such as window functions and nested queries.

I was surprised to see how common user questions on Hive mailing lists involve people writing multiple-stage queries using subqueries or correlated subqueries (e.g., select * from table where id = 1 group by dept) and their performance impact (see above). I also think it’s interesting that we’ve started building out new functionality like window functions into our branch of Hive (details here).

5. You can’t run MapReduce jobs without having code for them… but you don’t need a Ph.D. to use it.

I can’t count the number of times where I’ve seen queries submitted to Hadoop that are over 100 lines long, or queries that use one query per line at a terminal on a local machine (e.g., grep regex foo.*\.log | wc -l). The worst is when someone writes Java code for their MapReduce job inline in their Hive script — and then wonders why it doesn’t work after they submit it! My favorite feature of writing MapReduce jobs is that we deploy our version of Hadoop with Maven plugins and provide a shell script for people to run them. You just need to define your map and reduce methods in order to get started!

6. NoSQL databases can store nested data.

While a relational database is organized as tables, NoSQL databases may also support the storage of embedded documents or arrays inside records (or rows). In this example, we see an online advertising click event with three fields: user_id, ad_id, and price. When converted to JSON, you can see that each field has additional subfields, just like a record in a relational table might have a non-null foreign key. Using similar techniques for working with Avro schemas and Hive queries for Kafka topics means it’s easy to transform your event stream into aggregate reports without any preprocessing.

Conclusion:

I hope these examples show that there’s an easy way to get started with each of the Big Data tool chains — and you don’t need a Ph.D. (or many years experience) if you follow best practices! For me, I’ve been at LinkedIn for almost three years now and it’s been very rewarding to see our company grow out our infrastructure for using Kafka as part of our real-time data platform. Thanks again to all the contributors who submitted patches for this blog post (and this one too!).