Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the spatial framework for hadoop with data stored in ORC files #85

Open
dvonck opened this issue Jun 23, 2015 · 5 comments
Open
Labels

Comments

@dvonck
Copy link

dvonck commented Jun 23, 2015

Good Afternoon,

The ORC format allows for the efficient storage and retrieval of big data files. For more details see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC.

We have installed a Hadoop Cluster based on the Hortonworks Data Platform 2.2.6.0-2800.

When we work with csv files in hive we do not have any problems . When we use the ORC file format we get the following problems.

ORC Problem

[hive@srv-hc10 ~]$ hive

hive> add jar esri-geometry-api-1.2.1.jar spatial-sdk-hive-1.0.3-SNAPSHOT.jar spatial-sdk-json-1.0.3-SNAPSHOT.jar;
Added [esri-geometry-api-1.2.1.jar, spatial-sdk-hive-1.0.3-SNAPSHOT.jar, spatial-sdk-json-1.0.3-SNAPSHOT.jar] to class path
Added resources: [esri-geometry-api-1.2.1.jar, spatial-sdk-hive-1.0.3-SNAPSHOT.jar, spatial-sdk-json-1.0.3-SNAPSHOT.jar]
hive> create temporary function ST_Bin as 'com.esri.hadoop.hive.ST_Bin';
OK
Time taken: 0.636 seconds
hive> create temporary function ST_BinEnvelope as 'com.esri.hadoop.hive.ST_BinEnvelope';
OK
Time taken: 0.014 seconds

hive> describe formatted xxxxxxx.events_orc;
OK

col_name data_type comment

vehicle_id int
ignition smallint
event_ts bigint
event_description string
longitude double
latitude double
altitude string
speed smallint
bearing smallint
linear_g double
lateral_g double
trip_no int

Detailed Table Information

Database: xxxxxxx
Owner: root
CreateTime: Thu Jun 18 22:41:42 SAST 2015
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://srv-hcm01.esri-southafrica.com:8020/apps/hive/warehouse/xxxxxxx.db/events_orc
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE false
auto.purge true
comment xxxxxxx analysis table
last_modified_by root
last_modified_time 1434727038
numFiles 62
numRows -1
orc.compress SNAPPY
rawDataSize -1
totalSize 1954173667
transient_lastDdlTime 1434727038

Storage Information

SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: 62
Bucket Columns: [vehicle_id]
Sort Columns: [Order(col:event_ts, order:1)]
Storage Desc Params:
serialization.format 1
Time taken: 1.135 seconds, Fetched: 47 row(s)
hive> select ST_Bin(0.001, ST_Point(longitude, latitude)) as binvalue, count(*) as freq
> from xxxxxxx.events_orc
> where longitude is not null and latitude is not null and vehicle_id = 63962497
> group by ST_Bin(0.001, ST_Point(longitude, latitude));
Query ID = hive_20150623124949_0461acf6-46d8-41e4-99e1-6b62836abf6a
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.

Status: Running (Executing on YARN cluster with App id application_1434395264469_0091)


    VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 FAILED 68 0 0 68 153 67

Reducer 2 KILLED 8 0 0 8 0 8

VERTICES: 00/02 [>>--------------------------] 0% ELAPSED TIME: 23.79 s

Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1434395264469_0091_1_00, diagnostics=[Task failed, taskId=task_1434395264469_0091_1_00_000011, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.Error: Cannot allocate vector column for None
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Error: Cannot allocate vector column for None
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.allocateColumnVector(VectorizedRowBatchCtx.java:643)
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:606)
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:339)
at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:109)
at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:49)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:58)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:33)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.createValue(TezGroupedSplitsInputFormat.java:141)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:150)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:609)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:588)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:140)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:361)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:134)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
... 13 more
], TaskAttempt 1 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.Error: Cannot allocate vector column for None
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Error: Cannot allocate vector column for None
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.allocateColumnVector(VectorizedRowBatchCtx.java:643)
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:606)
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:339)
at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:109)
at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:49)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:58)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:33)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.createValue(TezGroupedSplitsInputFormat.java:141)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:150)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:609)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:588)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:140)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:361)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:134)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
... 13 more
], TaskAttempt 2 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.Error: Cannot allocate vector column for None
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Error: Cannot allocate vector column for None
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.allocateColumnVector(VectorizedRowBatchCtx.java:643)
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:606)
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:339)
at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:109)
at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:49)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:58)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:33)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.createValue(TezGroupedSplitsInputFormat.java:141)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:150)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:609)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:588)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:140)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:361)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:134)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
... 13 more
], TaskAttempt 3 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.Error: Cannot allocate vector column for None
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:172)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Error: Cannot allocate vector column for None
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.allocateColumnVector(VectorizedRowBatchCtx.java:643)
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addScratchColumnsToBatch(VectorizedRowBatchCtx.java:606)
at org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.createVectorizedRowBatch(VectorizedRowBatchCtx.java:339)
at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:109)
at org.apache.hadoop.hive.ql.io.orc.VectorizedOrcInputFormat$VectorizedOrcRecordReader.createValue(VectorizedOrcInputFormat.java:49)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:58)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:33)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.createValue(TezGroupedSplitsInputFormat.java:141)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:150)
at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:80)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:609)
at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:588)
at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:140)
at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:361)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:134)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:162)
... 13 more
]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1434395264469_0091_1_00 [Map 1] killed/failed due to:null]
Vertex killed, vertexName=Reducer 2, vertexId=vertex_1434395264469_0091_1_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0, Vertex vertex_1434395264469_0091_1_01 [Reducer 2] killed/failed due to:null]
DAG failed due to vertex failure. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
hive>

Could you please investigate if this is viable.

Regards

Derck

@climbage
Copy link
Member

Interesting. We will try to reproduce this. In the meantime, can you disable vectorization to try and get around the error?

set hive.vectorized.execution.enabled = false;

This may affect the performance of the queries.

@dvonck
Copy link
Author

dvonck commented Jun 23, 2015

Hi Michael

Using set hive.vectorized.execution.enabled = false; and set hive.default.fileformat=TextFile; made the queries work. Having a look at the source code it looks like ORC does not know how to work with the spatial types in columns.

Looking at the code at http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.1.0/org/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatchCtx.java#VectorizedRowBatchCtx.allocateColumnVector%28java.lang.String%2Cint%29

630 private ColumnVector More ...allocateColumnVector(String type, int defaultSize) {

631 if (type.equalsIgnoreCase("double")) {

632 return new DoubleColumnVector(defaultSize);

633 } else if (VectorizationContext.isStringFamily(type)) {

634 return new BytesColumnVector(defaultSize);

635 } else if (VectorizationContext.decimalTypePattern.matcher(type).matches()){

636 int [] precisionScale = getScalePrecisionFromDecimalType(type);

637 return new DecimalColumnVector(defaultSize, precisionScale[0], precisionScale[1]);

638 } else if (type.equalsIgnoreCase("long") ||

639 type.equalsIgnoreCase("date") ||

640 type.equalsIgnoreCase("timestamp")) {

641 return new LongColumnVector(defaultSize);

642 } else {

643 throw new Error("Cannot allocate vector column for " + type);

644 }

645 }

646

Thank you very much for your help.

You can close this issue.

Regards

Derck

From: Michael Park [mailto:notifications@github.com]
Sent: 23 June 2015 04:37 PM
To: Esri/spatial-framework-for-hadoop
Cc: Derck Vonck
Subject: Re: [spatial-framework-for-hadoop] Using the spatial framework for hadoop with data stored in ORC files (#85)

Interesting. We will try to reproduce this. In the meantime, can you disable vectorization to try and get around the error?

set hive.vectorized.execution.enabled = false;

This may affect the performance of the queries.


Reply to this email directly or view it on GitHubhttps://github.com//issues/85#issuecomment-114528200.

@krishnat2
Copy link

Hey we are able to run spatial data with ORC Files.

I ran to the same problem as you. After Some research I figured that TEZ Engine uses Vectorization which does not support Binary Datatype. When we compute ST_Point or ST_Polygon the result is binary data. So just disabling vectorization for this step solves your problem

@ColeFerrier
Copy link

ColeFerrier commented May 26, 2016

I don't see this method that is called out on master:

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedRowBatchCtx.java

do we think this is still a problem on hive-master?

It looks like it was changed in this commit:

apache/hive@30f20e9

then the code in

https://github.com/apache/hive/blame/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedBatchUtil.java

at some point was updated to include binary support. or it appears that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants