Vivek Mishra's Blog

Hive/HBase integration


Assuming you are aware of Hive and HBase basic concepts, my POC around Hive/HBase integration is : 1) Provide real time analytics on HBase.
2) Indexing mechanism.
3) Load Data from HDFS to HBase using Hive logical tables.
4) Bidirectional CRUD operation from Hive<->HBase on same dataset.
Please refer to http://wiki.apache.org/hadoop/Hive/HBaseIntegration. Codebase: i took hive 0.6.0 tag version and applied patch commited on trunk for Hive-1264. HBase version: 0.20.6 To build code base please use “ant tar” task for proper build for $HIVE_SRC. Note: 

  1.      With cloudera distribution of HBase 0.89.X i was facing integration so i moved to Hbase 0.20.6, which worked well.
  2.   In case of getting any heap issue on starting Hive, please change HADOOP_HEAPSIZE in $HIVE_SRC\bin\ext\util\execHiveCmd.sh.

Starting Zookeeper:sudo /usr/lib/zookeeper/bin/zKServer.sh

  • Start HBase shell:
    sudo $HBASE_HOME/bin/hbase master start sudo bin/hbase shell from HBase shell issue “status” to status of master node running. If result is coming as “MasterNotRunningException:null”  then you need to troubleshoot your HBase configuration.
     
Start Hive shell: sudo /home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/bin/hive --auxpath /home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/lib/hive_hbase-handler.jar,/home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/lib/hbase-0.20.6.jar,/home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/lib/zookeeper-3.2.2.jar -hiveconf hbase.master=127.0.1.1:60000

 

 

 

 

 

 

 

Working Example:

create a new HBase table which is to be managed by Hive
CREATE TABLE hive_hbasetable_k(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "hivehbasek");

Create a logical table pokes in Hive
CREATE TABLE pokes (foo INT, bar STRING);

Insert data in pokes from HDFS.
LOAD DATA LOCAL INPATH '/home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Load data into HBase table using Hive.
INSERT OVERWRITE TABLE hive_hbasetable_k SELECT * FROM pokes WHERE foo=98;

Issuing

select * from hive_hbasetable_k;
Result:
OK
98 val_98

Let’s scan HBase to validate data is loaded or not by issuing:

hbase(main):015:0> scan "hivehbasek"
ROW COLUMN+CELL
98 column=cf1:val, timestamp=1290502385735, value=val_98
1 row(s) in 0.1410 seconds

Before we try to load data from HBase and query it from Hive. Two points i would like to share:

  1.  If you have noticed that upon creating a HBase table managed by Hive you need to specify columnFamily and column name(cf1 is column family and val is it’s name). So while putting data from HBase make sure to define column name, else it will not be included when you query it via Hive.
  2.  Second is, You must have noticed that hive_hbasetable_k is holding key as an “int”. So while putting data please make sure to insert only integer value in rows, else it will come as NULL key while querying via Hive.(e.g. put ‘hivehbasek’,’r1:key’,’cf1:val’,’99_val’)

Load data from HBase

hbase(main):012:0> put 'hivehbasek','99','cf1:val','99_val'
hbase(main):012:0>put 'hivehbasek','r1:key','cf1:val','99_val'
hbase(main):015:0> scan "hivehbasek"
ROW COLUMN+CELL
98 column=cf1:val, timestamp=1290502385735, value=val_98
99 column=cf1:val, timestamp=1290502652876, value=99_val
r1:key column=cf1:val, timestamp=1290502601128, value=99_val
3 row(s) in 0.1410 seconds

Now upon querying the same from Hive:
hive> select * from hive_hbasetable_k;
OK
98 val_98
99 99_val
NULL 99_val

Example 2: Create Hive table on existing HBase Table and querying it via Hive.


hbase(main):023:0> create 'hbasetohive', 'colFamily'
hbase(main):023:0> put 'hbasetohive', '1s', 'colFamily:val','1strowval'

hbase(main):023:0> scan 'hbasetohive'
ROW COLUMN+CELL
1s column=colFamily:val, timestamp=1290503237527, value=1strowval
1 row(s) in 0.0430 seconds

hive> CREATE EXTERNAL TABLE hbase_hivetable_k(key string, value string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “colFamily:val”)
TBLPROPERTIES(“hbase.table.name” = “hbasetohive”);

hive> select * from hbase_hivetable_k;
OK
1s 1strowval
Time taken: 0.515 seconds

 

—————————————————————————————————————————————————-

HBase 0.93-SNAPSHOT and Hive 0.7.1 Integration:

Today i spent some time for this integration and found out that it is now bit different. So thought to share my way to test this integration.

Steps:

1)  build HBase (used mvn clean install -Dmaven.test.skip=true)

2) Copied hbase/conf/hbase-site.xml to generated HBase snapshot jar(i.e. target/HBase-0.93.0-SNAPSHOT)

3) build Hive using ” ant jar” and ” ant binary” task execution.

4) Copy generated Hbase jar to Hive/build/dist/lib(remove existing Hbase jars)

5) Follow steps given on top of post to start hbase master.

6)  give a command like /home/impadmin/source/hive/build/dist/bin/hive -hiveconf hive.root.logger=INFO,console,hbase.master=localhost:60000,hbase.zookeeper.quorum=localhost

Change to your hbase master and zookeeper settings. it should connect to you to Hbase master.

7) Follow above given examples to verify .

Happy programming :)

About these ads

5 comments on “Hive/HBase integration

  1. Shash
    June 22, 2012

    Hi,,,,In integration steps you mentioned “build HBase”, so does that mean standard CDH distribution won’t work for it?

  2. mevivs
    June 22, 2012

    Hi,
    This was the case “cloudera distribution of HBase 0.89.X “, Not sure with latest cloudera dist.

  3. Nishu Tayal
    July 16, 2013

    while running this command :

    “INSERT OVERWRITE TABLE hive_hbasetable_k SELECT * FROM pokes WHERE foo=98;”

    I am getting following error :

    Exception in thread “Thread-40″ java.lang.RuntimeException: Error while reading from task log url
    at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getStackTraces(TaskLogProcessor.java:240)
    at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:227)
    at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:92)
    at java.lang.Thread.run(Thread.java:722)
    Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://CTSINGMRGTO.cts.com:50060/tasklog?taskid=attempt_201307151221_0232_m_000000_2&start=-8193
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1612)
    at java.net.URL.openStream(URL.java:1035)
    at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getStackTraces(TaskLogProcessor.java:192)
    … 3 more
    FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask.

  4. mevivs
    July 16, 2013

    Verify if Hadoop process are running properly.

  5. Nishu Tayal
    July 17, 2013

    All hadoop processes are running properly.

Thanks for your comment. I will revert back soon on this.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Information

This entry was posted on November 24, 2010 by in Hadoop, Hive/HBase, TechStuff.
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: