Skip to content
This repository has been archived by the owner on Apr 25, 2020. It is now read-only.
Dmitry Vasilenko edited this page Sep 3, 2013 · 13 revisions

The XML SerDe allows the user to map the XML schema to Hive data types through the Hive Data Definition Language (DDL), according to the following rules.

CREATE [EXTERNAL] TABLE <table_name> (<column_specifications>)
ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"
WITH SERDEPROPERTIES (
["xml.processor.class"="<xml_processor_class_name>",]
"column.xpath.<column_name>"="<xpath_query>",
... ["xml.map.specification.<element_name>"="<map_specification>" ... ] ) STORED AS INPUTFORMAT "com.ibm.spss.hive.serde2.xml.XmlInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat" [LOCATION "<data_location>"] TBLPROPERTIES ( "xmlinput.start"="<start_tag ", "xmlinput.end"="<end_tag>" );

For example, the following XML... F 1 1 2 2 0 1 1 1 4 0 2 2 18 1.003392 2.740608 0

...would be represented by the following Hive DDL.

CREATE TABLE xml_bank(customer_id STRING, demographics map<string,string>, financial map<string,string>) ROW FORMAT SERDE ’com.ibm.spss.hive.serde2.xml.XmlSerDe’ WITH SERDEPROPERTIES ( "column.xpath.customer_id"="/record/@customer_id", "column.xpath.demographics"="/record/demographics/", "column.xpath.financial"="/record/financial/" ) STORED AS INPUTFORMAT ’com.ibm.spss.hive.serde2.xml.XmlInputFormat’ OUTPUTFORMAT ’org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’ TBLPROPERTIES ( "xmlinput.start"="<record customer", "xmlinput.end"="" );

Clone this wiki locally