This tutorial provides a quick introduction to using current integration/hive module.
-
Create a sample.csv file using the following commands. The CSV file is required for loading data into CarbonData.
cd carbondata cat > sample.csv << EOF id,name,scale,country,salary 1,yuhai,1.77,china,33000.1 2,runlin,1.70,china,33000.2 EOF
-
copy data to HDFS
$HADOOP_HOME/bin/hadoop fs -put sample.csv <hdfs store path>/sample.csv
- Add the following params to $SPARK_CONF_DIR/conf/hive-site.xml
<property>
<name>hive.metastore.pre.event.listeners</name>
<value>org.apache.carbondata.hive.CarbonHiveMetastoreListener</value>
</property>
- Start Spark shell by running the following command in the Spark directory
./bin/spark-shell --jars <carbondata assembly jar path, carbon hive jar path>
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._
val newSpark = SparkSession.builder().config(sc.getConf).enableHiveSupport.config("spark.sql.extensions","org.apache.spark.sql.CarbonExtensions").getOrCreate()
newSpark.sql("drop table if exists hive_carbon")
newSpark.sql("create table hive_carbon(id int, name string, scale decimal, country string, salary double) STORED AS carbondata")
newSpark.sql("LOAD DATA INPATH '<hdfs store path>/sample.csv' INTO TABLE hive_carbon")
newSpark.sql("SELECT * FROM hive_carbon").show()
mkdir hive/auxlibs/
cp carbondata/assembly/target/scala-2.11/carbondata_2.11*.jar hive/auxlibs/
cp carbondata/integration/hive/target/carbondata-hive-*.jar hive/auxlibs/
cp $SPARK_HOME/jars/spark-catalyst*.jar hive/auxlibs/
cp $SPARK_HOME/jars/scala*.jar hive/auxlibs/
export HIVE_AUX_JARS_PATH=hive/auxlibs/
copy snappy-java-xxx.jar from "./<SPARK_HOME>/jars/" to "./Library/Java/Extensions"
export HADOOP_OPTS="-Dorg.xerial.snappy.lib.path=/Library/Java/Extensions -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib -Dorg.xerial.snappy.tempdir=/Users/apple/DEMO/tmp"
hive/lib/ (for hive server)
yarn/lib/ (for MapReduce)
Carbon Jars to be copied to the above paths.
$HIVE_HOME/bin/beeline
- Write data from hive into carbondata format.
create table hive_carbon(id int, name string, scale decimal, country string, salary double) stored by 'org.apache.carbondata.hive.CarbonStorageHandler';
insert into hive_carbon select * from parquetTable;
Note: Only non-transactional tables are supported when created through hive. This means that the standard carbon folder structure would not be followed and all files would be written in a flat folder structure.
- This is to read the carbon table through Hive. It is the integration of the carbon with Hive.
set hive.mapred.supports.subdirectories=true;
set mapreduce.dir.recursive=true;
These properties helps to recursively traverse through the directories to read the carbon folder structure.
- Query the table
select * from hive_carbon;
select count(*) from hive_carbon;
select * from hive_carbon order by id;
- Partition table support is not handled
- Currently because carbon is implemented as a non-native hive table, therefore the user has to add the
storage_handler
information in tblproperties if the table has to be accessed from hive. Once the tblproperties have been updated, the user would not be able to do certain operations like alter, update/delete, etc., in both spark and hive.
alter table <tableName> set tblproperties('storage_handler'= 'org.apache.carbondata.hive.CarbonStorageHandler');