Hadoop Tutorials.CO.IN
Big Data - Hadoop - Hadoop Ecosystem - NoSQL - Spark

Apache Sqoop - Advanced Features

by Tanmay Deshpande


Introduction to Apache Sqoop

Apache Sqoop is tool which allows users to import data from relational databases to HDFS and export data from HDFS to relational database. In an another article we we learnt about Apache Sqoop fundamentals. Now it's time to understand some advanced features of it.

Importing Data in Compressed Format

Sometimes we would need to keep imported data on HDFS in compressed state in order to reduce the overall disk utilization. As sqoop is based on Map Reduce execution, it inherits Hadoop's all compression features. It allows is to save imported data into various compressed formats e.g. GZIP or BZ2 etc. We can execute following query to use by default gzip compression,

sqoop import \
--connect jdbc:mysql:// hostname/database_name \
--username user1 \
--password pwd \
--table student \
--compress

This will store all imported files with .gz extension. To use other compression codecs, we have mentioned the same at the time of execution.

sqoop import \
--connect jdbc:mysql://hostname/database_name \
--username user1 \
--password pwd \
--table student \
--compress\
--compression-codec org.apache.hadoop.io.compress.BZip2Codec

This will store all files with .bz2.

Performance Optimization by controlling parallelism



We can specify the degree of parallel task execution to import data from source database system. Here increasing parallelism means increasing the number of mappers from default 4 to higher number. For many database systems and Hadoop clusters, increasing value to 8 or 16 has given a significant performance improvement. But to do so we have to consider many other things, like the no. of mappers slots on Hadoop cluster or spool space of given user etc. It is recommended to keep lesser map tasks than your cluster map slots if that values increases than the map slots, tasks start execution in serial way which results in degraded performance.



Let's try to understand how sqoop works parallel on given set of data. Consider you have a student table which consists of 1000 records and which has got roll_no as primary key starting from 1 to 1000. Now this case, sqoop will automatically find out the primary key for given table and will distribute equal set of records to each mapper. Suppose if we are going with 4 mappers here, then first mapper will work on (1-250) records, second mapper will work on (251-500), third will work on (501-750) and forth will work on (751-1000) records.

Introduction to Apache Sqoop

To set the number of mappers, we can execute following commands

sqoop import \
--connect jdbc:mysql://hostname/database_name \
--username user1 \
--password pwd \
--table student \
--num-mappers 8

Or

sqoop import \
--connect jdbc:mysql://hostname/database_name \
--username user1 \
--password pwd \
--table student \
--m 8

Atomic Export

We have learned in basic database concepts about atomicity which means do a complete job or do nothing. Similarly, if you are exporting data using Sqoop to a table which is very important from the application point of view and you want make sure that sqoop should export all data present in HDFS, then you can use staging-table option. Here, we will mention a staging table where sqoop will export all its data and it all data export is successful then only it should write that data in actual table.



Syntax for the same is given below

$ sqoop export \
--connect jdbc:mysql://hostname/database_name \
--username user1 \
--password pwd \
-- export-dir /user/hive/warehouse/student
--staging-table staging_student

This insures the atomicity of data export. If any parallel job initiated by Sqoop fails, sqoop would not transfer any data to desired table.





Search

Follow us on Twitter

Recommended for you