Getting Started with Hadoop, Hive, and Sqoop

by on 28 August, 2012

    Everyone wants to work with big Data. I work with SAS and they recently released the SAS/ACCESS Interface to Hadoop. When I started working to setup Hive, which is required to use Hadoop with SAS, I ran into some issues that I had a hard time finding solutions for. Since I plan on writing posts on how to use Hadoop in SAS, I thought I would first post on what I did to get Hive set up. Ultimately, it's not that difficult to set up Hive, but there are a few nuggets of information in this post you might find helpful if you run into issues.

     

    This is a short tutorial on installing Hadoop, configuring a Hadoop single node setup, and using Sqoop to load data from Oracle into a Hadoop Hive database. 

    This tutorial was tested with:

    • Ubuntu 12.04 LTS x86_64
    • Java Version 1.6.0_31
    • hadoop-1.0.3
    • sqoop-1.4.2
    • hive-0.8.1

     

    Tech Review

    Hadoop – An Apache open-source project that enables applications to work with a distributed file system to handle large data sets. Hadoop uses a framework called MapReduce to process data in the HDFS (Hadoop Distributed File System).

     

    Sqoop – A tool used to transfer data between Hadoop and relational databases. Sqoop reads the column information from the database and generates Java classes that represent that data for you to use in Map/Reduce jobs.

     

    Hive – A data warehouse system for Hadoop. Hive uses a SQL-like language called HiveQL. Statements written in HiveQL are translated into MapReduce jobs.

     

    SAS Note: SAS/ACCESS supports access to Hadoop via Hive. For more information on using Hadoop in SAS download:

    http://support.sas.com/documentation/cdl/en/achadoop/65227/PDF/default/achadoop.pdf

     

     

    Installation

    Prerequisites

    • Gnu/Linux
    • Java 1.6.x (preferred)
    • SSH

    Download the latest stable version of Apache Hadoop.

    http://mirrors.ibiblio.org/apache/hadoop/common/stable/

     

    Download the latest version of Sqoop for the version of Hadoop you downloaded.

    http://www.trieuvan.com/apache/sqoop/

     

    Download the latest stable version of Hive.

    http://mirror.metrocast.net/apache/hive/stable/

     

    Also, download the JDBC drivers for your database. You will need these later.

    Oracle: http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html

    MySQL: http://dev.mysql.com/downloads/connector/j/

     

     

    Hadoop Configuration

    Once you unpack the Hadoop, edit the conf/hadoop-env.sh and define JAVA_HOME

     

    Example conf/hadoop-env.sh:

    # The java implementation to use.  Required.

    export JAVA_HOME=/usr/lib/jvm/j2sdk1.6-oracle/

     

    Set the HADOOP_HOME environment variable.

    Run:

    $ export HADOOP_HOME=/usr/local/hadoop

     

    Configure the following files:

     

    conf/core-site.xml:

     

    <configuration>

         <property>

             <name>fs.default.name</name>

             <value>hdfs://localhost:9000</value>

         </property>

    </configuration>

     

    This sets the default file system name.

     

    conf/hdfs-site.xml:

     

    <configuration>

         <property>

             <name>dfs.replication</name>

             <value>1</value>

         </property>

    </configuration>

     

    This sets the default block replication. Since we are only running Hadoop with one node we are setting this to 1. If this is not set it would automatically be set to 3, the default Hadoop replication value.

     

    conf/mapred-site.xml:

     

    <configuration>

         <property>

             <name>mapred.job.tracker</name>

             <value>localhost:9001</value>

         </property>

    </configuration>

     

    This sets the host and port that the MapReduce job tracker runs at.

     

     

    Starting Hadoop

     

    Format the NameNode

    Run:

    $ $HADOOP_HOME/bin/hadoop namenode –format

     

    Example output:

    12/08/27 20:47:58 INFO namenode.NameNode: STARTUP_MSG:

    /************************************************************

    STARTUP_MSG: Starting NameNode

    STARTUP_MSG:   host = hadoop02.bnl-consulting.com/10.0.0.159

    STARTUP_MSG:   args = [-format]

    STARTUP_MSG:   version = 1.0.3

    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012

    ************************************************************/

    12/08/27 20:47:58 INFO util.GSet: VM type       = 64-bit

    12/08/27 20:47:58 INFO util.GSet: 2% max memory = 19.33375 MB

    12/08/27 20:47:58 INFO util.GSet: capacity      = 2^21 = 2097152 entries

    12/08/27 20:47:58 INFO util.GSet: recommended=2097152, actual=2097152

    12/08/27 20:47:59 INFO namenode.FSNamesystem: fsOwner=hdfs

    12/08/27 20:47:59 INFO namenode.FSNamesystem: supergroup=supergroup

    12/08/27 20:47:59 INFO namenode.FSNamesystem: isPermissionEnabled=true

    12/08/27 20:47:59 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100

    12/08/27 20:47:59 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)

    12/08/27 20:47:59 INFO namenode.NameNode: Caching file names occuring more than 10 times

    12/08/27 20:48:00 INFO common.Storage: Image file of size 110 saved in 0 seconds.

    12/08/27 20:48:01 INFO common.Storage: Storage directory /tmp/hadoop-hdfs/dfs/name has been successfully formatted.

    12/08/27 20:48:01 INFO namenode.NameNode: SHUTDOWN_MSG:

    /************************************************************

    SHUTDOWN_MSG: Shutting down NameNode at hadoop02.bnl-consulting.com/10.0.0.159

    ************************************************************/

     

    Start the Cluster

    Run:

    $ $HADOOP_HOME/bin/start-all.sh

     

    If you get any errors on this step, check that you have ssh configured properly.

    Check that you can ssh to the localhost.

    $ ssh localhost

     

    If not then run the following:

    $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

    $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

     

    Installing Hive

    Once you unpack Hive set the HIVE_HOME environment variable.

    Run:

    $ export HIVE_HOME=usr/local/Hive

     

    Now that Hadoop and Hive are both installed and running you need to create directories for the Hive metastore and set their permissions.

    Run:

    $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp

    $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse

    $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp

    $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

     

    Test your Hive install.

    Run:

    $ $HIVE_HOME/bin/hive

    hive> show tables;

    OK

    Time taken: 6.374 seconds

     

    If you have any issues check that you set the hdfs directory permissions properly. Also check the settings in $HIVE_HOME/conf/hive-site.xml.

     

    Using Sqoop

    Once you have unpacked Sqoop move the JDBC jars you downloaded into the sqoop/lib/ directory.

     

    Run:

    $ bin/sqoop list-databases --connect jdbc:oracle:thin:@HOSTNAME:PORT/SID --username USERNAME -P

     

    If you do not see a list of familiar databases you should check your connection string and your JDBC driver.

     

    Before you get too excited and try to import entire databases please check the following requirements.

    • Each table MUST have a SINGLE-COLUMN primary key

    • You must intend to import all columns of each table

    • You must not intend to use non-default splitting column, nor impose any condition via a where clause

     

    To avoid issues with importing an entire database it may be best to import one table at a time.

     

    Before running the following statement make sure that HIVE_HOME is added to PATH.

     

    To import a table run:

    $ $SQOOP_HOME/bin/sqoop import --connect jdbc:oracle:thin:@HOSTNAME:PORT/SID--username USERNAME -P -table TABLENAME --hive-import -verbose -m 1

     

    If you have issues running this statement make sure that your USERNAME and TABLENAME are in uppercase. If you continue to have issues make sure you are using the right port and check that you have the correct JDBC driver.

     

    Test your import with the following statement:

    $ $HIVE_HOME/bin/hive

    hive> show tables;

    TABLENAME

    Time taken: 6.374 seconds

     

    Try running a basic request on the table.

    hive>select * from TABLENAME limit 10;

     

    Troubleshooting Using Sqoop

     

    Clear the output directory

    If you have any issues running the Sqoop import you will need to clear that table’s output location in HDSF.

    Example:

    ERROR security.UserGroupInformation: PriviledgedActionException as: USERNAME cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory TABLENAME already exists

    ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory TABLENAME already exists

     

    Execute the following code, replacing the USERNAME and TABLENAME with your username and table name, to clear the output directory.

     

    $ $HADOOP_HOME/bin/hadoop dfs –rmr /user/USERNAME/TABLENAME

     

    Check Permissions

    A common issue is not having the correct permissions set. If you get a Java exception that says it cannot create or write to a file location, just adjust the permissions to the folder listed in the Java exception.

     

    Conclusion

    I hope this tutorial helped you get started using Hadoop, Hive, and Sqoop. My next tutorial will be on using Hadoop in SAS.