Hadoop Tutorials.CO.IN
Big Data - Hadoop - Hadoop Ecosystem - NoSQL - Spark

Data Units in Apache Hive

by Tanmay Deshpande


Data Units in Apache Hive

Hive is a data warehousing tool so most of the data that we get is in structured for and is getting stored in either flat files or tools like Teradata or Informatica. To process such structured that we have classified the information in detailed granularity:



Databases

These are nothing but the Namespaces that separate tables and other data units from naming confliction. Usually we have shared Hadoop/Hive cluster amongst multiple teams in regular production environment. At that time it is very important for us to segregate our work from other teams. So it is important to create our own database while working on Hive

A table is an organized set of records which have same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):

Tables

timestamp - A column which gives the time some has browsed the URL

userid - A unique identification of user

page_url - A string which consists of host name port and page visited.

referer_url - A URL from where the user has arrived to the current page.

IP - IP address of the machine from where user is accessing the website page.



Partitions

Partitions are nothing with logical segregation of data which classifies the given information based on certain attribute. Each Table can have one or more partitions which determines how the data is stored. For example, we can partition above mentioned page_views table on week_start_date we can calculate week_start_date using timestamp columns. When we tell hive to partition the data on this key, it will assemble all records in specific folders as and when we are inserting the records. Here in this case the data would get arranged on weekly segmentation. The benefit of doing so is to restrict the query to process only that data which is required. e.g. Here if we specify that we want to run certain query on week_start_date='6/1/2014' then it will execute the query only on 6/1/2014 partition. This helps in faster data analysis and improved performance.



Buckets (or Clusters)

This is similar to partitions but here to segregate data we use hash function to decide, which cluster/bucket the record should go. Partitioning and bucketing are optional activates just to improve performance and making data more manageable.







Search

Follow us on Twitter

Recommended for you