Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[ORC] Encounter bitmap out of bound issue in evaluateFilter #557

Open
zhixingheyi-tian opened this issue Nov 9, 2021 · 3 comments
Open
Labels
bug Something isn't working

Comments

@zhixingheyi-tian
Copy link
Collaborator

Describe the bug
When run TPCDS integration testing. Encounter below out of bound issue

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 9) (vsr532 executor 1): max_bitmap_index 1920799 must be <= maxSupportedValue 65535 in selection vector
        at org.apache.arrow.gandiva.evaluator.JniWrapper.evaluateFilter(Native Method)
        at org.apache.arrow.gandiva.evaluator.Filter.evaluate(Filter.java:179)
        at org.apache.arrow.gandiva.evaluator.Filter.evaluate(Filter.java:131)
        at com.intel.oap.expression.ColumnarConditionProjector$$anon$1.hasNext(ColumnarConditionProjector.scala:241)
        at com.intel.oap.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:47)
        at org.apache.spark.sql.execution.ColumnarBroadcastExchangeExec.$anonfun$relationFuture$2(ColumnarBroadcastExchangeExec.scala:107)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
@zhixingheyi-tian zhixingheyi-tian added the bug Something isn't working label Nov 9, 2021
@zhixingheyi-tian
Copy link
Collaborator Author

zhixingheyi-tian commented Nov 9, 2021

By debugging,have figured out the cause was from Arrow:file_orc.cc

 Result<RecordBatchIterator> Execute() override {

...
Result<std::shared_ptr<RecordBatch>> Next() {
        if (i_ == num_stripes_) {
          return nullptr;
        }
        std::shared_ptr<RecordBatch> batch;
        // TODO (https://issues.apache.org/jira/browse/ARROW-14153)
        // pass scan_options_->batch_size
        return reader_->ReadStripe(i_++, included_fields_);
      }

...

}

Now ORC in Arrow dataset has not yet honored the ScanOptions batch_size option.

So the returned recordbatch size maybe > 65535

@zhixingheyi-tian
Copy link
Collaborator Author

cc @zhouyuan @zhztheplayer

@zhouyuan
Copy link
Collaborator

#556 may help

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants