#More details, see wiki page(Chinese wiki 中文使用手册): https://github.com/mayanhui/hbase-secondary-index/wiki
###################################################
###################################################
##0.Environment
- hadoop: 1.0.4
- hbase: 0.94.0
- zookeeper: 3.4.3
- hive: 0.9.0
- thrift: 0.9.0
##1.Many ways to build index
###1.1 MapReduce
Using integration mapreduce to build hbase index for main table. The main structure is:
(1) scan input table by TableMapper<ImmutableBytesWritable, Writable>
(2) get the rowkey and special colum name and value
(3) create instance of Put with value=rowkey, and rowkey=columnName + "_" +columnValue
(4) use IdentityTableReducer to put data into index table
Index type support:
-
build single column index
-
build multi single-column index together
-
build combined-column index
-
build json column index. single-field, combined-field index
-
build rowkey only index
Command to build index:
-
- build single column index
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid -s 20130101 -e 20130120 -v 1
-
- build multi single-column index together
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid,cf1:age,cf2:msg
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid,cf1:age,cf2:msg -s 20130101 -e 20130120 -v 3
-
- build combined-column index
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid,cf1:age,cf2:msg -si false
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid,cf1:age,cf2:msg -si false -s 20130101 -e 20130120 -v 1
-
- build json column index. single-field, combined-field index
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:msg -j area,type,category
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:msg -j area,type,category -s 20130101 -e 20130120 -v 1
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:msg -j area,type,category -si false
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:msg -j area,type,category -si false -s 20130101 -e 20130120 -v 1
-
- build rowkey only index
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c rowkey -r uid:1,mid:2,isrowkey:1
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c rowkey:cf1:content -r uid:1,mid:2,isrowkey:1
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c rowkey:cf1:content -r uid:1,mid:2,isrowkey:1 -s 20130101 -e 20130120
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c rowkey:cf1:content -r uid:1,mid:2,isrowkey:1 -s 20130101 -e 20130120 -v 1
###1.2 ITHBase
$HBASE_HOME/conf/hbase-site.xml:
- hbase.hlog.splitter.impl
org.apache.hadoop.hbase.regionserver.transactional.THLogSplitter
- hbase.regionserver.class
org.apache.hadoop.hbase.ipc.IndexedRegionInterface
- hbase.regionserver.impl
org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer
- hbase.hregion.impl
org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegion
###1.3 IHBase
The implementation of this method is from https://github.com/ykulbak/ihbase. However, the code is not available at all due to many classes missing. This method is not recommended because it is invasive.
###1.4 Coprocessor A demo is implemented. This method is proposed from habse-0.92.0 and not perfect now. The characteristic are:
- Must implement a train of code for an index. Poor Reusability.
- Must disable table before using alter table. unfriendly method for online service.
- Better than other online index building methods(invasive).
#####################
#####################
##2 MapReduce Usage
###2.1 Build from source code Download the source code first and then use maven to build jar. go into the project and do:
mvn install
Note: You need to install maven >= 2.2.1
###2.2 use jar You can see the jar file in root directory of project: hbase-secondary-index-0.1.jar You can use it directly!
###2.3 Build index Use the example of buildindex.sh in directory 'src/main/resources' Such as:
hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i user_behavior_attribute_noregistered -o user_behavior_attribute_noregistered_index -c bhvr:vvmid -s 20130101 -e 20130120 -v 3
usage: Build-Secondary-Index -c family:qualifier [-d] [-e ] -i [-j ] -o [-r ] [-s ] [-si ] [-v ]
-c,--column family:qualifier column to store row data into (must exist). Such as: cf1:age,cf2:tag,cf2:msg or rowkey or rowkey,cf1:age. The last two usage are for 'rowkey' index building.
-d,--debug switch on DEBUG log level
-e,--edate the end date of data to build index(default is today), such as: 20130120
-i,--input the directory or file to read from (must exist)
-j,--json json fields to build index. The max number of fields is 3! This kind of data uses IndexJsonMapper.class.
-o,--output table to import into (must exist)
-r,--rowkey rowkey fields to build index. The max number of fields is 2! This kind of data uses IndexRowkeyMapper.class. The format is: uid:1,msgid:2,isrowkey:1 uid and msgid are the field name, 1 and 2 is the order in the rowkey(like: uid_msgid_ts). isrowkey is the label to define which field is the new rowkey. The separator in rowkey is _ . You can use validate column to build incremental index. If use validate column, you need to add a column to -c parameter, the -c should be 'rowkey,cf1:age'
-s,--sdate the start date of data to build index(default is 19700101), such as: 20130101
-si,--sindex if use single index. true means 'single index', false means 'combined index'(default is true). If build combined index, the max number of columns is 3.
-v,--versions the versions of each cell to build index(default is Integer.MAX_VALUE)
##License Released under the GPLv3 license. For full details, pleasesee the LICENSE file included in this distribution.