Hadoop Tutorials.CO.IN
Big Data - Hadoop - Hadoop Ecosystem - NoSQL - Spark

Introduction to Apache Hive

by Tanmay Deshpande

Introduction to Apache Hive

Today's era as we know is the Data Era. We see data all over places right from our cell phones to the high end enterprise servers, from social networking sites to search engines and from text messages to video calls the data is growing and with the need to process that in cost and time effective manner is increasing.

Seeing that need, Hadoop Ecosystem got into picture which was offering both time and cost effectiveness. Slowly every other organization started using Hadoop and MapReduce programs. At one side the need of Hadoop like systems were increasing on other hand people with those niche skills were very less. To use Hadoop, companies have to have the best Java coders who can write complex Map Reduce codes which were very difficult. Even the companies who had Java developers who can code Map reduce programmers, soon started feeling that dependency on those programmers was increasing day-by-day and even for a smaller set of results, Data Analyst became dependent on those programmers which resulted in slow turn out.

People understood the need of time and decided to create something on top of Hadoop which would be accessible to the wider audience. This is where Hive comes in. Hive provides a SQL (Also known as Hive Query Language or HiveQL ) like interface for users to extract data out of Hadoop system.

SQL knowledge is wide spread and anyone who has decent knowledge would be able to use Hive effectively. Hive translates the query into Java Map Reduce code and runs the same on Hadoop cluster. Hive is best suited for Data Warehousing applications where data is structured, static and formatted. Hive is not a complete database. Design considerations of Hadoop and HDFS impose some constraints on what Hive can do. Hive does not provide row wise update and insert which is a biggest disadvantage of using it. But like we said earlier, Hive is not meant to be used for OTLP applications, it is meant to be used for Data Warehousing applications. As most Hive queries turn out into Map Reduce jobs, hive queries have higher latency due to start up overhead. Because of this, queries that would finish in milliseconds on traditional databases would take more time on Hive even on smaller set of data.

Hive is not OLTP (Online Transaction Processing) tool for sure but its closer to OLAP (Online Analytical Processing) but again it conflicts with word Online in it due to high latency.

So if Hive is neither OLTP nor OLAP then what it is used for? So the answer is Hive is best suited for Data Warehousing Applications where data is stored, mined and reporting is done based on processing. As most Data Warehousing applications are based on relational database models, Hive bridges the gap between these applications and Hadoop.

However, like most SQL interfaces, HiveQL does not conform to ANSI SQL standard. It differs in various ways.

Hive Is

Hive is a data warehousing tool based on Hadoop. As we know Hadoop provides massive scale out on distributed infrastructure with high degree of fault tolerance for data storage and processing. Hadoop uses Map Reduce algorithm to process huge amount of data with minimal cost as it does not require high end machines to process such amount of data. Hive processor converts most of its queries into a Map Reduce job which runs on Hadoop cluster. Hive is designed for easy and effective data aggregation, ad-hoc querying and analysis of huge volumes of data.

Hive Is Not

Even though Hive gives SQL dialect it does not give SQL like latency as it ultimately runs Map Reduce programs underneath. As we all know, Map Reduce framework is built for batch processing jobs it has high latency, even the fastest hive query would take several minutes to get executed on relatively smaller set of data in few megabytes. We cannot simply compare the performance of traditional SQL systems like Oracle, MySQL or SQL Server as these systems are meant to do something and Hive is meant to do else. Hive aims to provide acceptable (but not optimal) latency for interactive querying over small data sets for sample queries

Like we said earlier, hive is not an OLTP (Online transaction Processing) application and not meant to be connected with systems which needs interactive processing. It is meant to be used to process batch jobs on huge data which is immutable. A good example of such kind of data would be Web logs, Application Logs, call data records (CDR) etc.


Follow us on Twitter

Recommended for you