Assuming you are aware of Hive and HBase basic concepts, my POC around Hive/HBase integration is : 1) Provide real time analytics on HBase.
2) Indexing mechanism.
3) Load Data from HDFS to HBase using Hive logical tables.
4) Bidirectional CRUD operation from Hive<->HBase on same dataset.
Please refer to http://wiki.apache.org/hadoop/Hive/HBaseIntegration. Codebase: i took hive 0.6.0 tag version and applied patch commited on trunk for Hive-1264. HBase version: 0.20.6 To build code base please use “ant tar” task for proper build for $HIVE_SRC. Note:
Starting Zookeeper:sudo /usr/lib/zookeeper/bin/zKServer.sh
sudo $HBASE_HOME/bin/hbase master start sudo bin/hbase shell from HBase shell issue “status” to status of master node running. If result is coming as “MasterNotRunningException:null” then you need to troubleshoot your HBase configuration.sudo /home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/bin/hive --auxpath /home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/lib/hive_hbase-handler.jar,/home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/lib/hbase-0.20.6.jar,/home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/lib/zookeeper-3.2.2.jar -hiveconf hbase.master=127.0.1.1:60000
Working Example:
create a new HBase table which is to be managed by Hive
CREATE TABLE hive_hbasetable_k(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "hivehbasek");
Create a logical table pokes in Hive
CREATE TABLE pokes (foo INT, bar STRING);
Insert data in pokes from HDFS.
LOAD DATA LOCAL INPATH '/home/impetus/Hadoop/Hadoop-0.21/hive/build/dist/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Load data into HBase table using Hive.
INSERT OVERWRITE TABLE hive_hbasetable_k SELECT * FROM pokes WHERE foo=98;
Issuing
select * from hive_hbasetable_k;
Result:
OK
98 val_98
Let’s scan HBase to validate data is loaded or not by issuing:
hbase(main):015:0> scan "hivehbasek"
ROW COLUMN+CELL
98 column=cf1:val, timestamp=1290502385735, value=val_98
1 row(s) in 0.1410 seconds
Before we try to load data from HBase and query it from Hive. Two points i would like to share:
Load data from HBase
hbase(main):012:0> put 'hivehbasek','99','cf1:val','99_val'
hbase(main):012:0>put 'hivehbasek','r1:key','cf1:val','99_val'
hbase(main):015:0> scan "hivehbasek"
ROW COLUMN+CELL
98 column=cf1:val, timestamp=1290502385735, value=val_98
99 column=cf1:val, timestamp=1290502652876, value=99_val
r1:key column=cf1:val, timestamp=1290502601128, value=99_val
3 row(s) in 0.1410 seconds
Now upon querying the same from Hive:
hive> select * from hive_hbasetable_k;
OK
98 val_98
99 99_val
NULL 99_val
Example 2: Create Hive table on existing HBase Table and querying it via Hive.
hbase(main):023:0> create 'hbasetohive', 'colFamily'
hbase(main):023:0> put 'hbasetohive', '1s', 'colFamily:val','1strowval'hbase(main):023:0> scan 'hbasetohive'
ROW COLUMN+CELL
1s column=colFamily:val, timestamp=1290503237527, value=1strowval
1 row(s) in 0.0430 seconds
hive> CREATE EXTERNAL TABLE hbase_hivetable_k(key string, value string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “colFamily:val”)
TBLPROPERTIES(“hbase.table.name” = “hbasetohive”);
hive> select * from hbase_hivetable_k;
OK
1s 1strowval
Time taken: 0.515 seconds
—————————————————————————————————————————————————-
HBase 0.93-SNAPSHOT and Hive 0.7.1 Integration:
Today i spent some time for this integration and found out that it is now bit different. So thought to share my way to test this integration.
Steps:
1) build HBase (used mvn clean install -Dmaven.test.skip=true)
2) Copied hbase/conf/hbase-site.xml to generated HBase snapshot jar(i.e. target/HBase-0.93.0-SNAPSHOT)
3) build Hive using ” ant jar” and ” ant binary” task execution.
4) Copy generated Hbase jar to Hive/build/dist/lib(remove existing Hbase jars)
5) Follow steps given on top of post to start hbase master.
6) give a command like /home/impadmin/source/hive/build/dist/bin/hive -hiveconf hive.root.logger=INFO,console,hbase.master=localhost:60000,hbase.zookeeper.quorum=localhost
Change to your hbase master and zookeeper settings. it should connect to you to Hbase master.
7) Follow above given examples to verify .
Happy programming
Hi,,,,In integration steps you mentioned “build HBase”, so does that mean standard CDH distribution won’t work for it?
Hi,
This was the case “cloudera distribution of HBase 0.89.X “, Not sure with latest cloudera dist.