Getting Started with Hadoop, Hive, and Sqoop

by on 28 August, 2012

Everyone wants to work with big Data. I work with SAS and they recently released the SAS/ACCESS Interface to Hadoop. When I started working to setup Hive, which is required to use Hadoop with SAS, I ran into some issues that I had a hard time finding solutions for. Since I plan on writing posts on how to use Hadoop in SAS, I thought I would first post on what I did to get Hive set up. Ultimately, it's not that difficult to set up Hive, but there are a few nuggets of information in this post you might find helpful if you run into issues.


This is a short tutorial on installing Hadoop, configuring a Hadoop single node setup, and using Sqoop to load data from Oracle into a Hadoop Hive database. 

This tutorial was tested with:

  • Ubuntu 12.04 LTS x86_64
  • Java Version 1.6.0_31
  • hadoop-1.0.3
  • sqoop-1.4.2
  • hive-0.8.1


Tech Review

Hadoop – An Apache open-source project that enables applications to work with a distributed file system to handle large data sets. Hadoop uses a framework called MapReduce to process data in the HDFS (Hadoop Distributed File System).


Sqoop – A tool used to transfer data between Hadoop and relational databases. Sqoop reads the column information from the database and generates Java classes that represent that data for you to use in Map/Reduce jobs.


Hive – A data warehouse system for Hadoop. Hive uses a SQL-like language called HiveQL. Statements written in HiveQL are translated into MapReduce jobs.


SAS Note: SAS/ACCESS supports access to Hadoop via Hive. For more information on using Hadoop in SAS download:





  • Gnu/Linux
  • Java 1.6.x (preferred)
  • SSH

Download the latest stable version of Apache Hadoop.


Download the latest version of Sqoop for the version of Hadoop you downloaded.


Download the latest stable version of Hive.


Also, download the JDBC drivers for your database. You will need these later.





Hadoop Configuration

Once you unpack the Hadoop, edit the conf/ and define JAVA_HOME


Example conf/

# The java implementation to use.  Required.

export JAVA_HOME=/usr/lib/jvm/j2sdk1.6-oracle/


Set the HADOOP_HOME environment variable.


$ export HADOOP_HOME=/usr/local/hadoop


Configure the following files:











This sets the default file system name.











This sets the default block replication. Since we are only running Hadoop with one node we are setting this to 1. If this is not set it would automatically be set to 3, the default Hadoop replication value.











This sets the host and port that the MapReduce job tracker runs at.



Starting Hadoop


Format the NameNode


$ $HADOOP_HOME/bin/hadoop namenode –format


Example output:

12/08/27 20:47:58 INFO namenode.NameNode: STARTUP_MSG:


STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host =

STARTUP_MSG:   args = [-format]

STARTUP_MSG:   version = 1.0.3

STARTUP_MSG:   build = -r 1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012


12/08/27 20:47:58 INFO util.GSet: VM type       = 64-bit

12/08/27 20:47:58 INFO util.GSet: 2% max memory = 19.33375 MB

12/08/27 20:47:58 INFO util.GSet: capacity      = 2^21 = 2097152 entries

12/08/27 20:47:58 INFO util.GSet: recommended=2097152, actual=2097152

12/08/27 20:47:59 INFO namenode.FSNamesystem: fsOwner=hdfs

12/08/27 20:47:59 INFO namenode.FSNamesystem: supergroup=supergroup

12/08/27 20:47:59 INFO namenode.FSNamesystem: isPermissionEnabled=true

12/08/27 20:47:59 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100

12/08/27 20:47:59 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)

12/08/27 20:47:59 INFO namenode.NameNode: Caching file names occuring more than 10 times

12/08/27 20:48:00 INFO common.Storage: Image file of size 110 saved in 0 seconds.

12/08/27 20:48:01 INFO common.Storage: Storage directory /tmp/hadoop-hdfs/dfs/name has been successfully formatted.

12/08/27 20:48:01 INFO namenode.NameNode: SHUTDOWN_MSG:


SHUTDOWN_MSG: Shutting down NameNode at



Start the Cluster




If you get any errors on this step, check that you have ssh configured properly.

Check that you can ssh to the localhost.

$ ssh localhost


If not then run the following:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

$ cat ~/.ssh/ >> ~/.ssh/authorized_keys


Installing Hive

Once you unpack Hive set the HIVE_HOME environment variable.


$ export HIVE_HOME=usr/local/Hive


Now that Hadoop and Hive are both installed and running you need to create directories for the Hive metastore and set their permissions.


$ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp

$ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse

$ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp

$ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse


Test your Hive install.


$ $HIVE_HOME/bin/hive

hive> show tables;


Time taken: 6.374 seconds


If you have any issues check that you set the hdfs directory permissions properly. Also check the settings in $HIVE_HOME/conf/hive-site.xml.


Using Sqoop

Once you have unpacked Sqoop move the JDBC jars you downloaded into the sqoop/lib/ directory.



$ bin/sqoop list-databases --connect jdbc:oracle:thin:@HOSTNAME:PORT/SID --username USERNAME -P


If you do not see a list of familiar databases you should check your connection string and your JDBC driver.


Before you get too excited and try to import entire databases please check the following requirements.

  • Each table MUST have a SINGLE-COLUMN primary key

  • You must intend to import all columns of each table

  • You must not intend to use non-default splitting column, nor impose any condition via a where clause


To avoid issues with importing an entire database it may be best to import one table at a time.


Before running the following statement make sure that HIVE_HOME is added to PATH.


To import a table run:

$ $SQOOP_HOME/bin/sqoop import --connect jdbc:oracle:thin:@HOSTNAME:PORT/SID--username USERNAME -P -table TABLENAME --hive-import -verbose -m 1


If you have issues running this statement make sure that your USERNAME and TABLENAME are in uppercase. If you continue to have issues make sure you are using the right port and check that you have the correct JDBC driver.


Test your import with the following statement:

$ $HIVE_HOME/bin/hive

hive> show tables;


Time taken: 6.374 seconds


Try running a basic request on the table.

hive>select * from TABLENAME limit 10;


Troubleshooting Using Sqoop


Clear the output directory

If you have any issues running the Sqoop import you will need to clear that table’s output location in HDSF.


ERROR security.UserGroupInformation: PriviledgedActionException as: USERNAME cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory TABLENAME already exists

ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory TABLENAME already exists


Execute the following code, replacing the USERNAME and TABLENAME with your username and table name, to clear the output directory.


$ $HADOOP_HOME/bin/hadoop dfs –rmr /user/USERNAME/TABLENAME


Check Permissions

A common issue is not having the correct permissions set. If you get a Java exception that says it cannot create or write to a file location, just adjust the permissions to the folder listed in the Java exception.



I hope this tutorial helped you get started using Hadoop, Hive, and Sqoop. My next tutorial will be on using Hadoop in SAS.