Skip to content
This repository has been archived by the owner on Apr 25, 2020. It is now read-only.
Dmitry Vasilenko edited this page Sep 3, 2013 · 13 revisions

The XML SerDe allows the user to map the XML schema to Hive data types through the Hive Data Definition Language (DDL), according to the following rules.

CREATE [EXTERNAL] TABLE <table_name> (<column_specifications>)
ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"
WITH SERDEPROPERTIES (
["xml.processor.class"="<xml_processor_class_name>",]
"column.xpath.<column_name>"="<xpath_query>",
... ["xml.map.specification.<element_name>"="<map_specification>"
...
]
)
STORED AS
INPUTFORMAT "com.ibm.spss.hive.serde2.xml.XmlInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"
[LOCATION "<data_location>"]
TBLPROPERTIES (
"xmlinput.start"="<start_tag ",
"xmlinput.end"="<end_tag>"
);

For example, the following XML... <records>
<record customer_id="0000-JTALA">
<demographics>
<gender>F</gender>
<agecat>1</agecat>
<edcat>1</edcat>
<jobcat>2</jobcat>
<empcat>2</empcat>
<retire>0</retire>
<jobsat>1</jobsat>
<marital>1</marital>
<spousedcat>1</spousedcat>
<residecat>4</residecat>
<homeown>0</homeown>
<hometype>2</hometype>
<addresscat>2</addresscat>
</demographics>
<financial>
<income>18</income>
<creddebt>1.003392</creddebt>
<othdebt>2.740608</othdebt>
<default>0</default>
</financial>
</record>
</records> ...would be represented by the following Hive DDL.

CREATE TABLE xml_bank(customer_id STRING, demographics map<string,string>, financial map<string,string>)
ROW FORMAT SERDE ’com.ibm.spss.hive.serde2.xml.XmlSerDe’
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.demographics"="/record/demographics/",
"column.xpath.financial"="/record/financial/
"
)
STORED AS
INPUTFORMAT ’com.ibm.spss.hive.serde2.xml.XmlInputFormat’
OUTPUTFORMAT ’org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"=""
);

Clone this wiki locally