Even though hive is designed like SQL engines but it's actually does not work like most of the other SQL engines. So it's very important understand hive architecture in order to use it efficiently. Let's discuss such best practices to get maximum performance benefits from hive.
Use of Partitions
Earlier database experts used to create a table per day means they used to create a complete new table for every single day. Though this practice needs lots of maintenance, people still use this in real time scenarios. But in case of hive, we have the concept of partitions which helps in maintaining a new partition for every new day without putting many efforts. As we have already seen, in case of partitions hive creates new folder and store the records accordingly which helps in targeting only required area and to avoid scanning the complete table.
To use day wise partitions we can create table like this
CREATE TABLE page_view (
PARTITIONED BY ( day STRING) ;
This will create day wise folder under page_view folder on HDFS. Like this
Even though there are various benefits of creating partitions in table, like we say, anything in access in dangerous. When hive creates a partition for a table, it has to maintain the extra metadata to redirect query as per partition. So if in any case we get too many partitions a table, it would get difficult for hive to cater to this situation. So it is very important to understand the data growth and the kind of data we are going to get so that we can plan our schemas. Also it's very important to select correct columns for partitioning after completely understanding the kind of queries we want to on that data. As a partition for some queries would be beneficial at the same time it could affect the performance of other queries badly. As we know HDFS is beneficial when we have smaller set of large files instead of larger set of smaller files.
Like any other SQL engines, we don't have any primary keys and foreign keys in Hive as hive is not meant to run complex relational queries. It's used to get data in easy and efficient manner. So while designing hive schema, we don't need to bother about selecting unique key etc. Also we don't need to bother about normalizing the set of data for efficiency.
By keeping denormalized data, we avoid multiple disk seeks which is generally the case when we have foreign key relations. Here by not doing this, we avoid multiple I/O operation which ultimately helps in performance benefits.
Efficient Use of Single Scan
As we all know, hive does complete table scan in order to process a certain query. So it's recommend to use that single scan to perform multiple operations. Take a look at the following queries
INSERT INTO page_views_20140101
SELECT * FROM page_views WHERE date='20140101';
INSERT INTO page_views_20140102
SELECT * FROM page_views
Here the queries work fine but to execute them, processor does complete table scan for two times. In order to avoid this and to do it one go, we can execute following query
INSERT INTO page_views_20140101 SELECT * WHERE date='20140101'
INSERT INTO page_views_20140102 SELECT * WHERE date='20140102'
Use of Bucketing
Bucketing is a similar optimization technique as partitioning but looking at the concerns of over partitioning; we can always go for system defined data segregation. Buckets distribute the data load into user defined set of clusters by calculating hash code of key mentioned in query. Bucketing is useful when it is difficult to create partition on a column as it would be having huge variety of data in that column on which we want to run queries.
One such example would be page_views table where we want to run queries on user_id but looking at the no. users it would get difficult to create separate partition for each and every user. So in this case we can create buckets on user_id.
Syntax for creating bucketed table is as follows
CREATE TABLE page_views( user_id INT, session_id BIGINT, url
PARTIONED BY (day INT)
CLUSTERED BY (user_id) INTO 100;
Here the data would be classified depending upon hash number of user_id into 100 buckets. So when someone queries for certain user_id, the processor will first calculate the hash number of the user_id in query and will look for only that bucket.
Let's understand this in more detail,
Say we have following set of data to be loaded into page_views table
(user_id, session_id, url)
(1, 1111, http://example.com/page/1)
(2, 354, http://example.com/page/1)
(22, 76764, http://example.com/page/1)
(21, 74747, http://example.com/page/1)
(63, 64646, http://example.com/page/1)
Suppose the hash code algorithm is modulus 10 on user_id then while loading the data, processor will calculate hash code on each user_id and then send the record as per the hash code
e.g. For record 1, the hash code for user_id= 1 would be 1 modulus
10 = 1
For user_id=23 the hash code would be, 3 (23 modulus 10 = 3)
For user_id=63 the has code would be, 3 (63 modulus 10 = 3)
So all records would get classified into 10 buckets named 0 to 9
So now when we get a query on user_id=89 then the processor will only go to Bucket 9 and this way we would save our time by no scanning all the data.
Use Compressions wherever possible
Compression of data on HDFS makes data smaller to query on, which ultimately helps in reducing the query time. Even though compression-decompression consumes lot of CPU power, it does not affect map reduce jobs much as they are I/O bound. So it's highly recommended to use compressions whenever and wherever possible.