Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[NSE-99]Update Performance Result for Decision Support Benchmark1&2 #106

Merged
merged 3 commits into from
Feb 19, 2021

Conversation

weiting-chen
Copy link
Collaborator

What changes were proposed in this pull request?

Close #99

How was this patch tested?

This is a document update only.

@weiting-chen weiting-chen requested a review from zhouyuan February 9, 2021 08:23
@weiting-chen weiting-chen self-assigned this Feb 9, 2021
@weiting-chen weiting-chen added the documentation Improvements or additions to documentation label Feb 9, 2021
@weiting-chen weiting-chen changed the title Issue0099 [NSE-99]Update Performance Result for Decision Support Benchmark1&2 Feb 9, 2021
@github-actions
Copy link

github-actions bot commented Feb 9, 2021


README.md Outdated
@@ -110,6 +110,26 @@ For initial microbenchmark performance, we add 10 fields up with spark, data siz

![Performance](./docs/image/performance.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please also remove the outdated perf data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions
Copy link


@weiting-chen weiting-chen merged commit 79997bf into oap-project:master Feb 19, 2021
HongW2019 pushed a commit to HongW2019/gazelle_plugin that referenced this pull request Sep 2, 2021
This commit implements the Native SQL Engine for OAP. 

The key components are:
- Using Apache Arrow as column vector format as intermediate data among Spark operator.
- Enable Apache Arrow native readers for Parquet and other formats.
- Leverage Apache Arrow Gandiva/Compute to evaluate columnar expressions with SIMD optimizations

OAP Native SQL Engine is verified by TPC-H workload as of this commit. Please refer to the detailed guide on how to install and test.

Co-authored-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Rong Ma <rong.ma@intel.com>
Co-authored-by: Jiayi Chen <Jiayi.chen@intel.com>
Co-authored-by: Hongze Zhang <hongze.zhang@intel.com>
Co-authored-by: Rui Mo <rui.mo@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
Co-authored-by: Binwei Yang <binwei.yang@intel.com>

======================
* ProjectList prepare check and type change

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Add new ReadWriteBench

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Add ColumnarHashAggregate support

Framework done, Codes workable, saw fault result

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Add an optimization to skip unnecessary project work

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Bug fix] fixed multiple cols aggregation failing issue

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update README.md

* Update ApacheArrowInstallation.md

* use arrow version property to record the right version

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* integrate columnar shuffle operator and relative UT

* Update README.md

* Update ApacheArrowInstallation.md

* Update ApacheArrowInstallation.md

* Apply coding format onto current project

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update README.md

* [C++]Fix cpp code format

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Add files via upload

* Add files via upload

* [C++]Refactoring current cpp codes and change return using vector<RecordBatch>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++]Using google-code-style for c++ codes

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [JAVA]change java jniWrapper to return a ArrowRecordBatch array

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [SCALA]Bug fixing: ColumnarShuffleExchangeExec didn't recursively pass child to next operator.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++]Remove Gandiva Protobuf in this project, added in arrow side

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [DOC] Fix installation guide after we remove gandiva_protobuf

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++]Add jni_common.h

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add splitArray function

Aim to add a function to split one Array into multiple arrays with distinguish key.
Codes are done, runable with correct result, will try bigger input.

reminder: current we can only use one array as splitter.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Big fix when only splitting one array

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] refactoring codes to support a visitor chain

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* add CMakeLists.txt

* [C++] Change splitArray to only use one loop for all arrays

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Noticed a kernel_ext bug, fix here, also refined a bit codes

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] based on arrow commit 868c8c6, to pass hash_table to arrow compute functions.

Original, arrow will initialize a hash_table inside DictEncode function and ValueCounts function
which leads to multiple array can't be processed based on same hash_table, and by changing arrow
code, we now be able to pass a long live hash_table get an unified index for all arrays.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add new interface "finish"

This interface is designed to evaluate based on multiple recordBatch will generate a output when calling finish

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] add a new appendArrayToBatch function

Using this function, we can build a new recordbatch based on multiple recordBatch input, then we are able to make a final aggregate result for all.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Refactor split.h using array_builder_impl.h to maintain ArrayBuider

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [c++] Only use dict when splitting array

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update ApacheArrowInstallation.md

* [C++] support groupby aggregate in cpp level

1. added EncodeArray kernel
2. added a finish function mechanism
3. added appendToCache functions
4. splitArrayList uses indices instead of cache the whole list

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala/Java] Added support for groupBy aggregate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add jni support for finish function

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Fix on groupby aggregate feature

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++/Scala] Move merge multiple groupby batch into one implementation to CPP

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++/Scala] Continuelly optimize groupby hash aggregation by using action

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Fix DataType issue for HashAggregate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Optimize GroupBy HashAggregate performance

1. Used hash_table key column as uniqueAction input, so uniqueAction won't need to calculate each time
2. Set max group id at the beginning of row evaluation, so each action evaluate only need to do the real evaluation work.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Use AppendValues to build Array

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Change Makefile using O3, which significantly improves performance

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add Spark Metrics for HashAggregate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Using MinMax in SplitArrayListWith Action to get max group id

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add a new way to directly access data instead of using Array API
[C++] Using inline lambda instead of using function call in action

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Refactor arrow compute to simpify the workpath of calling Eval()

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Fix bug, ColumnarBatch was not closed before

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Disable ColumnarShuffle, looks it will cause OOM issue

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++/Scala] Use group when doing encodeArray and add a null check when closing ColumnarAggregation

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Bug fix, close the last columnarBatch and columnarAggregation instance

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Change to use cmake instead of makefile

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++]Add GoogleTests

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add a new unittest and macroed add_test

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Change to use groupby_aggregate.h Group func

Rebased intel arrow to lastest arrow commit and revert our changes to hash.h, and move group function to groupby_aggregate.h

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Sort support

To use sort, it requires two kernels, sortArraysToIndices will cook an indices array, then rest arrays can use this sorted_indices to do a shuffle.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Pass col type when make SplitArrayListWithAction Kernel

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* add AppendArrayKernel

This patch adds AppendArrayKernel support.

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] Bug fix, noticed we didn't use the java builder inside this project

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Fix bug when there is finish Func in expr

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++/Scala] Refine the way of extract hash aggregate input expression

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Change unittest to use action_dono, so we can get multiple return_type

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Enable ColumnarSort

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add benchmark

Benchmark will read a parquet file from local and do evaluation upon this file

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] ShuffleArrayList performance optimization

Use builder directly instead of using array_builder_impl.h

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++]Add big scale test, batch size is 5176

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add Iterator<RecordBatch> as Finish Return

Currently, we only supported return as std::vector<RecordBatch>, and I am thinking to add a new way of returning as iterator, to make it more extensible

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Jni + ColumnarSorter] use ResultIterator<RecordBatch> instead of return vector<RecordBatch>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [BUG FIX] Fix uninitialized row_id bug

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [JAVA] adding missing BatchIterator file

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] allow to operate on Long and Double type

Both Gandiva and Arrow Compute support these two types now.

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* adding vhashjoin support

This patch adds vhashjoin support w/ below major change:
- Allow to set member set for kernels
- Adding Take&NTake kernels
- Spark columnar plugin for ShuffledHashJoinExec(turned off now)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] Bug fix in ColumnarAggregate when some column will be trimed

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Implement this feature with two method:

1. Using utf8 to merge keys -> ConcatArrayKernel
2. use gandiva to do hash + add -> HashAggrArrayKernel

Now we chose to use gandiva

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala]Some fixing to support ColumnarAggregationWithTwoKeys

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add ColumnarBatchScan Support

By using which, we can use WSCG off when testing columnarBased process

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add a new ColumnarConditionProjector Operator

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP & Scala] Add desend and null first support for ColumnarSort

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Rename ColumnarCondProjExec to ColumnarConditionProjectExec

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Add an alternative ColumnarJoin implementation (oap-project#71)

* [CPP]ShuffleArrayList kernel fix when null exists

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add Join Benchmark

We used tpch lineitem and order table to test join, which contains 800+ batches

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] adding a new method for ColumnarJoin

Add a new kernel called probeArrays, which is used to input multiple arrays one by one, then probe primary key by another sets of arrays.
And also refined shuffleArrayListKernel, so by combining this two, we can join batches from two table together.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [JNI] Add jni support for using ResultIterator.Process

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Spark Columnar Support for ShuffledHashJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add New Unittest and BenchmarkTest for InnerJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Refactor current Join codes and support both right Join and InnerJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala]ColumnarShuffledHashJoin Refine for InnerJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala]ColumnarAggregate fix for Q4

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Remove JoinTime in ColumnarShuffledHashJoinExec and use one in ColumnarShuffledHashJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* fix cond projector without condition (oap-project#75)

should project with resultSchema

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix string support for columnar projection (oap-project#76)

* [Scala] fix string support for columnar projection

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix StringType convert

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] skip projector evaluate if filter has 0 row result

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix possible memory leak

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Add a new Action call CountLiterAction

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Support CountLiteral

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Wip avg support (oap-project#79)

* [CPP] Enabled groupby avg, AvgByCount and SumCount kernel

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [JNI] Add a new interface called setReturnFields

This interface is used to set result Schema when some of expressions return more than one fields and we can't use current gandiva expression to describe the schema.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] enable groupby avg

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Rewrite Unique Action and add String Support

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Remove Concat Kernal and Action and some codes refine

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] String fix

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update README.md

* Update ApacheArrowInstallation.md

* [CPP] Multiple Key Groupby fix and optimization

Noticed before groupby with multiple key returns incorrect result, and this commit will fix this
Also if multiple keys are all string, I will concat them with gandiva and do a hash firstly then doing encodeArray.
By doing which, will be a little faster then directly hash and add

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] SplitArray optimization

Move input array from lambda capture to class member, which will improve performance a lot.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Support Aggregation with projection inside case (oap-project#86)

By this new fix, we are able to run unmodified TPCH Q1

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] adding support for starts_with & ends_with (oap-project#78)

* [Scala] adding support for starts_with & ends_with

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] adding support for like

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix string like support

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] support substring

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Support String in ColumnarJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] LeftSemi Join support

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Continue fix aggregate issue for Q3

Now Q3 is runable

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Memory leak issue fixing

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP & Scala] Support multiple key join

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add groupby min and max and fix a bug in ShuffleArrayList Evaluate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add a new interface to get holder current size

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Refine current ConditionProjector codes

1. Use iterator instead of map in ConditionProjector, so we can skip empty columnarBatch as return
2. Fix several bugs and made input schema for condition and project more clear

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add a return column size in Columnar AggregatExpression

Since we may have one scenario like avg, which inputs one col and expected two column as return in partial phase and input two cols and expect one at final phase. Which is also a fix for Q1

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] ColumnarShuffleHashJoin with Knownfloating expr

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [scala] support In (oap-project#91)

* [scala] support In

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix get ordinal for ColumnarIn

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix get ordinal in agg (oap-project#92)

a special fix for Q10
Spark will do normalization when float/doubt type as join key

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] A attr fix in ColumnarAggregation

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Revert "[Scala] fix get ordinal in agg (oap-project#92)"

This reverts commit 9ed5992b63d7791e59a559c4902d7ca516d3e3b4.

* [Scala] Fix for Q11

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add a new expression who will collect subquery result and as literal in gandiva

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] adding support for extract_year (oap-project#88)

* [Scala] adding support for extract_year

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] cast utf8/int64 to date64 first

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] support DateType for Literal

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] add support for string Contains

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] use string based comparison for datetype

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] clean up

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Refine all Aggregation function and add SumCount, AvgByCount, Min and Max support

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] null key will be skipped in Groupby Case

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add native ResultIterator support for Groupby HashAggregate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] ColumnarHashAggregation and ColumnarProjection Refactor

Extracted current projection codes from ColumnarAggregation and made as a single class,
So we can apply ColumnarProjection to groupingExpression, aggregateExpression and resultExpression.

Also added return by batch support in ColumnarAggregation, so we won't return too much lines
which may result in memory leak.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] ColumnarConditionProjection fix after Aggregation Refine

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] extractYear fix to use Int32

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add a new interface to pass selectionVector

1. add selection support to evaluator and resultIterator
2. add selectionVector support to ProbeArrays
3. fix wo/ groupby aggregate result type issue

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add a new interface to pass selectionVector

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Using ConditionProjector to handler condition inside Join

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Support condition inside ColumnarJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] A walkaround to skip Condition when input doesn't contain this field

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Support multiple same primary key Join

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] shift groupby key hashed value then add to next one

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] support for Not (oap-project#80)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala & CPP] Fix ColumnarAggregation ResultIterator bug

Original we used Slice array in native codes, and when we pass this array to Java, Slice configuration will be lost so we are getting incorrect result.
Now we changed to build array inside ResultIterator Next function, and result is correct now.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] support case when (oap-project#100)

* [Scala] support case when

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix EquealTo

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix agg in case when

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] restore BinaryOperator

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] clean up

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] Cast dataType in BinaryOperator

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala & CPP] Support Outer Join

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Fix when aggregationExpression is empty

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Wip condition join (oap-project#106)

* [Scala & CPP] Support LeftAnti Join in ColumnarShuffledHashJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Move bindReference inside ColumnarConditionProjection

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add Native conditionedJoin

This PR is aim to do runtime codegen so we can perform a conditioned join operation,
Add a new ConditionedShuffleArrayList implementation
Add a new ConditionedProbeArrays implementation
Generate signature for codegen func, and use signature to check if lib exists
Add NoneCondition Support
Remove ShuffleArrayList implementation and change to use ConditionShuffleArrayList
Remove not in use Kernels and Actions

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Support new conditionedJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Remove original probeArrays kernel

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [scala] support function with in operator (oap-project#107)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Use original shuffle codes here to improve performance

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Fix AvgByCount bug

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add In Support when doing codegen and forward unknown function

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Fix a small bug in ColumnarExpressionConverter for Like

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Move SparkColumnarPlugin to oap-native-sql folder

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Small fixes (oap-project#1184)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Fixed a avg with groupby issue, now Q17 is correct (oap-project#1185)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [DO NOT MERGE]WIP Q2 fix (oap-project#1187)

* [CPP & Scala] Fixed some codes for ConditionedShuffle

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Q2_fix done

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Last commit invoked some mis-remove, fix here

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update README.md

* [nativesql] fix compile against new arrow (oap-project#1189)

* [nativesql] fix compile against new arrow

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] fix compile warning

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] remove unused headers

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* Update ApacheArrowInstallation.md

* [nativesql]Wip spark rebase (oap-project#1202)

* [nativesql] fix compile against new arrow

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] fix compile warning

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] remove unused headers

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [scala] fix spark reabasing

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [NativeSql] DeCouple Gandiva protobuf and hashing dependency (oap-project#1203)

* Copied Arrow Hashing to our repo so newly modification won't break our builds

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [scala] fix spark reabasing

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Add protobuf inside native sql

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>

* [NativeSql]refactor native parquet reader/writer (oap-project#1205)

* Remove sortArraysToIndices

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql] Move Parquet Reader and Writer into nativeSql

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql] Add libhdfs3.so to resource, which will be copied to /hadoop dir when doing make install

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add a parquet reader and writer adapter

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql] Refactor and move spark side commits to nativeSql

1. move parquet reader logic to nativesql
2. move ArrowWritableColumnVector to nativesql
3. Use postRule to call RowToArrowColumnVector
4. move cpp so to jar
5. remove benchmark folder
6. update readme

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql][CPP] Use CMake to download and compile protobuf

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update README.md

* ArrowDataSource for Spark (#1226)

* [oap-native-sql]Add Installation Notes (#1231)

* add InstallationNotes to README

* refine

* refine

* refine

* [NativeSql] ClassCastException if non-parquet data source is used (#1238)

* Move ArrowWritableColumnVector from org.apache to com.intel (#1243)

* [DataSource] Compilation error due to multiple source directories (#1244)

* [oap-native-sql]Wip refine protobuf install (#1230)

* [Building] refine protobuf dependency check

 - if not found, download protobuf and statically link to it
 - if found, reuse system level protobuf and dynamically link to it

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Building] check for dynamic protobuf lib only

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [oap-native-sql][Scala] support date32 (#1225)

* [Scala] support date32

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++][Java] Support Date32 in RowToColumn

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] support date32 in unique action

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Java] fix getUTF8String on Date32

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* set C++ 2011 standard (#1236)

* [Scala] fix contain to use is_substr (#1235)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Java] fix date32 projection (#1250)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [NativeSql][Scala] memory leak track and fixes (#1227)

* [NativeSql][Scala] memory leak track and fixes

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql][CPP] Another derived class should add virtual to its super destruction func

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [DataSource][Arrow] Supress exceptions from unexpected types when pushdown filters (#1253)

* Update README googletest installation (#1251)

* [DataSource][Arrow] Output schema mismatch when scanning for zero dat… (#1262)

* [DataSource][Arrow] Output schema mismatch when scanning for zero data columns

* [DataSource][Arrow] Use ArrowWritableColumnVector to fill partition values

* [DataSource][Arrow] Update README.md (#1263)

* [DataSource][Arrow] Add assembly build (#1264)

* [DataSource][Arrow] Download ArrowWritableColumnVector instead of having a copy (#1267)

* [oap-native-sql] Calling ColumnVectorUtils.populate(...) on ArrowWritableColumnVector leads to UnsupportedOperationException (#1268)

* [DataSource][Arrow] Source Downloading: Change to exec-maven-plugin (#1269)

* [DataSource][Arrow] Update README.md (#1276)

* [DataSource][Arrow] Update README.md (#1279)

* [Scala] adding IsNull support (#1256)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [oap-native-sql] Add open permission parameter (#1266)

* add open O_CREAT permission mode

* [DataSource][Arrow] Prune pushed filters that access partition columns (#1285)

* [oap-native-sql][Scala]Adding abs support (#1273)

* support abs

* [Building] building with spark-sql from our maven repo (#1249)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [DataSource][Arrow] Close batch every time new batch is read to avoid possible leaks (#1288)

* [DataSource][Arrow] File descriptor leak (#1295)

* inset (#1290)

* upper (#1301)

* [oap-native-sql][CI] update travis for native sql (#1294)

* [CI] update travis for native sql

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CI] fix grammar, use openjdk8

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CI] update to use python3 env

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Doc] update readme (#1308)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* coalesce (#1306)

* [oap-native-sql][Scala]adding if support (#1307)

* add IfOperator

* add boolean type

* [oap-native-sql] Enable ColumnarSort kernel with code generation (#1261)

* [NativeSql] ColumnarSort kernel

ColumnarSort is implemented with CodeGeneration method

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql] Fix compiling issue

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala]support date32 in IN epxression (#1303)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* adding ASF license (#1331)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
HongW2019 pushed a commit to HongW2019/gazelle_plugin that referenced this pull request Sep 2, 2021
This patch implements below main features for Native SQL engine:

- ColumnarExchange support
- runtime codegen for ColumnarShuffledHashJoin/ColumnarSort
- Configurable batch size for Arrow Data Source
- Support more Functions from TPCDS queries

Please refer to the detailed guide on how to install and test.

Co-authored-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Rong Ma <rong.ma@intel.com>
Co-authored-by: Jiayi Chen <Jiayi.chen@intel.com>
Co-authored-by: Hongze Zhang <hongze.zhang@intel.com>
Co-authored-by: Rui Mo <rui.mo@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
Co-authored-by: Binwei Yang <binwei.yang@intel.com>

=================
* [C++]Add jni_common.h

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add splitArray function

Aim to add a function to split one Array into multiple arrays with distinguish key.
Codes are done, runable with correct result, will try bigger input.

reminder: current we can only use one array as splitter.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Big fix when only splitting one array

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] refactoring codes to support a visitor chain

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* add CMakeLists.txt

* [C++] Change splitArray to only use one loop for all arrays

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Noticed a kernel_ext bug, fix here, also refined a bit codes

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] based on arrow commit 868c8c6, to pass hash_table to arrow compute functions.

Original, arrow will initialize a hash_table inside DictEncode function and ValueCounts function
which leads to multiple array can't be processed based on same hash_table, and by changing arrow
code, we now be able to pass a long live hash_table get an unified index for all arrays.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add new interface "finish"

This interface is designed to evaluate based on multiple recordBatch will generate a output when calling finish

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] add a new appendArrayToBatch function

Using this function, we can build a new recordbatch based on multiple recordBatch input, then we are able to make a final aggregate result for all.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Refactor split.h using array_builder_impl.h to maintain ArrayBuider

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [c++] Only use dict when splitting array

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update ApacheArrowInstallation.md

* [C++] support groupby aggregate in cpp level

1. added EncodeArray kernel
2. added a finish function mechanism
3. added appendToCache functions
4. splitArrayList uses indices instead of cache the whole list

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala/Java] Added support for groupBy aggregate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add jni support for finish function

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Fix on groupby aggregate feature

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++/Scala] Move merge multiple groupby batch into one implementation to CPP

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++/Scala] Continuelly optimize groupby hash aggregation by using action

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Fix DataType issue for HashAggregate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Optimize GroupBy HashAggregate performance

1. Used hash_table key column as uniqueAction input, so uniqueAction won't need to calculate each time
2. Set max group id at the beginning of row evaluation, so each action evaluate only need to do the real evaluation work.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Use AppendValues to build Array

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Change Makefile using O3, which significantly improves performance

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add Spark Metrics for HashAggregate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Using MinMax in SplitArrayListWith Action to get max group id

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add a new way to directly access data instead of using Array API
[C++] Using inline lambda instead of using function call in action

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Refactor arrow compute to simpify the workpath of calling Eval()

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Fix bug, ColumnarBatch was not closed before

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Disable ColumnarShuffle, looks it will cause OOM issue

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++/Scala] Use group when doing encodeArray and add a null check when closing ColumnarAggregation

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Bug fix, close the last columnarBatch and columnarAggregation instance

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Change to use cmake instead of makefile

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++]Add GoogleTests

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add a new unittest and macroed add_test

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Change to use groupby_aggregate.h Group func

Rebased intel arrow to lastest arrow commit and revert our changes to hash.h, and move group function to groupby_aggregate.h

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Sort support

To use sort, it requires two kernels, sortArraysToIndices will cook an indices array, then rest arrays can use this sorted_indices to do a shuffle.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Pass col type when make SplitArrayListWithAction Kernel

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* add AppendArrayKernel

This patch adds AppendArrayKernel support.

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] Bug fix, noticed we didn't use the java builder inside this project

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Fix bug when there is finish Func in expr

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++/Scala] Refine the way of extract hash aggregate input expression

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Change unittest to use action_dono, so we can get multiple return_type

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Enable ColumnarSort

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add benchmark

Benchmark will read a parquet file from local and do evaluation upon this file

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] ShuffleArrayList performance optimization

Use builder directly instead of using array_builder_impl.h

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++]Add big scale test, batch size is 5176

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [C++] Add Iterator<RecordBatch> as Finish Return

Currently, we only supported return as std::vector<RecordBatch>, and I am thinking to add a new way of returning as iterator, to make it more extensible

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Jni + ColumnarSorter] use ResultIterator<RecordBatch> instead of return vector<RecordBatch>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [BUG FIX] Fix uninitialized row_id bug

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [JAVA] adding missing BatchIterator file

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] allow to operate on Long and Double type

Both Gandiva and Arrow Compute support these two types now.

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* adding vhashjoin support

This patch adds vhashjoin support w/ below major change:
- Allow to set member set for kernels
- Adding Take&NTake kernels
- Spark columnar plugin for ShuffledHashJoinExec(turned off now)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] Bug fix in ColumnarAggregate when some column will be trimed

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Implement this feature with two method:

1. Using utf8 to merge keys -> ConcatArrayKernel
2. use gandiva to do hash + add -> HashAggrArrayKernel

Now we chose to use gandiva

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala]Some fixing to support ColumnarAggregationWithTwoKeys

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add ColumnarBatchScan Support

By using which, we can use WSCG off when testing columnarBased process

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add a new ColumnarConditionProjector Operator

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP & Scala] Add desend and null first support for ColumnarSort

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Rename ColumnarCondProjExec to ColumnarConditionProjectExec

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Add an alternative ColumnarJoin implementation (oap-project#71)

* [CPP]ShuffleArrayList kernel fix when null exists

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add Join Benchmark

We used tpch lineitem and order table to test join, which contains 800+ batches

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] adding a new method for ColumnarJoin

Add a new kernel called probeArrays, which is used to input multiple arrays one by one, then probe primary key by another sets of arrays.
And also refined shuffleArrayListKernel, so by combining this two, we can join batches from two table together.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [JNI] Add jni support for using ResultIterator.Process

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Spark Columnar Support for ShuffledHashJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add New Unittest and BenchmarkTest for InnerJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Refactor current Join codes and support both right Join and InnerJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala]ColumnarShuffledHashJoin Refine for InnerJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala]ColumnarAggregate fix for Q4

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Remove JoinTime in ColumnarShuffledHashJoinExec and use one in ColumnarShuffledHashJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* fix cond projector without condition (oap-project#75)

should project with resultSchema

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix string support for columnar projection (oap-project#76)

* [Scala] fix string support for columnar projection

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix StringType convert

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] skip projector evaluate if filter has 0 row result

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix possible memory leak

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Add a new Action call CountLiterAction

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Support CountLiteral

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Wip avg support (oap-project#79)

* [CPP] Enabled groupby avg, AvgByCount and SumCount kernel

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [JNI] Add a new interface called setReturnFields

This interface is used to set result Schema when some of expressions return more than one fields and we can't use current gandiva expression to describe the schema.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] enable groupby avg

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Rewrite Unique Action and add String Support

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Remove Concat Kernal and Action and some codes refine

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] String fix

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update README.md

* Update ApacheArrowInstallation.md

* [CPP] Multiple Key Groupby fix and optimization

Noticed before groupby with multiple key returns incorrect result, and this commit will fix this
Also if multiple keys are all string, I will concat them with gandiva and do a hash firstly then doing encodeArray.
By doing which, will be a little faster then directly hash and add

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] SplitArray optimization

Move input array from lambda capture to class member, which will improve performance a lot.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Support Aggregation with projection inside case (oap-project#86)

By this new fix, we are able to run unmodified TPCH Q1

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] adding support for starts_with & ends_with (oap-project#78)

* [Scala] adding support for starts_with & ends_with

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] adding support for like

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix string like support

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] support substring

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Support String in ColumnarJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] LeftSemi Join support

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Continue fix aggregate issue for Q3

Now Q3 is runable

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Memory leak issue fixing

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP & Scala] Support multiple key join

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add groupby min and max and fix a bug in ShuffleArrayList Evaluate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add a new interface to get holder current size

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Refine current ConditionProjector codes

1. Use iterator instead of map in ConditionProjector, so we can skip empty columnarBatch as return
2. Fix several bugs and made input schema for condition and project more clear

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add a return column size in Columnar AggregatExpression

Since we may have one scenario like avg, which inputs one col and expected two column as return in partial phase and input two cols and expect one at final phase. Which is also a fix for Q1

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] ColumnarShuffleHashJoin with Knownfloating expr

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [scala] support In (oap-project#91)

* [scala] support In

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix get ordinal for ColumnarIn

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix get ordinal in agg (oap-project#92)

a special fix for Q10
Spark will do normalization when float/doubt type as join key

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] A attr fix in ColumnarAggregation

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Revert "[Scala] fix get ordinal in agg (oap-project#92)"

This reverts commit 9ed5992b63d7791e59a559c4902d7ca516d3e3b4.

* [Scala] Fix for Q11

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add a new expression who will collect subquery result and as literal in gandiva

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] adding support for extract_year (oap-project#88)

* [Scala] adding support for extract_year

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] cast utf8/int64 to date64 first

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] support DateType for Literal

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] add support for string Contains

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] use string based comparison for datetype

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] clean up

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Refine all Aggregation function and add SumCount, AvgByCount, Min and Max support

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] null key will be skipped in Groupby Case

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add native ResultIterator support for Groupby HashAggregate

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] ColumnarHashAggregation and ColumnarProjection Refactor

Extracted current projection codes from ColumnarAggregation and made as a single class,
So we can apply ColumnarProjection to groupingExpression, aggregateExpression and resultExpression.

Also added return by batch support in ColumnarAggregation, so we won't return too much lines
which may result in memory leak.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] ColumnarConditionProjection fix after Aggregation Refine

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] extractYear fix to use Int32

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add a new interface to pass selectionVector

1. add selection support to evaluator and resultIterator
2. add selectionVector support to ProbeArrays
3. fix wo/ groupby aggregate result type issue

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Add a new interface to pass selectionVector

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Using ConditionProjector to handler condition inside Join

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Support condition inside ColumnarJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] A walkaround to skip Condition when input doesn't contain this field

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Support multiple same primary key Join

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] shift groupby key hashed value then add to next one

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] support for Not (oap-project#80)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala & CPP] Fix ColumnarAggregation ResultIterator bug

Original we used Slice array in native codes, and when we pass this array to Java, Slice configuration will be lost so we are getting incorrect result.
Now we changed to build array inside ResultIterator Next function, and result is correct now.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] support case when (oap-project#100)

* [Scala] support case when

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix EquealTo

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix agg in case when

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] restore BinaryOperator

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] clean up

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] Cast dataType in BinaryOperator

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala & CPP] Support Outer Join

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Fix when aggregationExpression is empty

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Wip condition join (oap-project#106)

* [Scala & CPP] Support LeftAnti Join in ColumnarShuffledHashJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Move bindReference inside ColumnarConditionProjection

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add Native conditionedJoin

This PR is aim to do runtime codegen so we can perform a conditioned join operation,
Add a new ConditionedShuffleArrayList implementation
Add a new ConditionedProbeArrays implementation
Generate signature for codegen func, and use signature to check if lib exists
Add NoneCondition Support
Remove ShuffleArrayList implementation and change to use ConditionShuffleArrayList
Remove not in use Kernels and Actions

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Support new conditionedJoin

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Remove original probeArrays kernel

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [scala] support function with in operator (oap-project#107)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Use original shuffle codes here to improve performance

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Fix AvgByCount bug

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add In Support when doing codegen and forward unknown function

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala] Fix a small bug in ColumnarExpressionConverter for Like

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Move SparkColumnarPlugin to oap-native-sql folder

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Small fixes (oap-project#1184)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Fixed a avg with groupby issue, now Q17 is correct (oap-project#1185)

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [DO NOT MERGE]WIP Q2 fix (oap-project#1187)

* [CPP & Scala] Fixed some codes for ConditionedShuffle

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Q2_fix done

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Last commit invoked some mis-remove, fix here

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update README.md

* [nativesql] fix compile against new arrow (oap-project#1189)

* [nativesql] fix compile against new arrow

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] fix compile warning

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] remove unused headers

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* Update ApacheArrowInstallation.md

* [nativesql]Wip spark rebase (oap-project#1202)

* [nativesql] fix compile against new arrow

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] fix compile warning

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] remove unused headers

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [scala] fix spark reabasing

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [NativeSql] DeCouple Gandiva protobuf and hashing dependency (oap-project#1203)

* Copied Arrow Hashing to our repo so newly modification won't break our builds

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [scala] fix spark reabasing

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CPP] Add protobuf inside native sql

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>

* [NativeSql]refactor native parquet reader/writer (oap-project#1205)

* Remove sortArraysToIndices

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql] Move Parquet Reader and Writer into nativeSql

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql] Add libhdfs3.so to resource, which will be copied to /hadoop dir when doing make install

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [CPP] Add a parquet reader and writer adapter

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql] Refactor and move spark side commits to nativeSql

1. move parquet reader logic to nativesql
2. move ArrowWritableColumnVector to nativesql
3. Use postRule to call RowToArrowColumnVector
4. move cpp so to jar
5. remove benchmark folder
6. update readme

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql][CPP] Use CMake to download and compile protobuf

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* Update README.md

* ArrowDataSource for Spark (#1226)

* [oap-native-sql]Add Installation Notes (#1231)

* add InstallationNotes to README

* refine

* refine

* refine

* [NativeSql] ClassCastException if non-parquet data source is used (#1238)

* Move ArrowWritableColumnVector from org.apache to com.intel (#1243)

* [DataSource] Compilation error due to multiple source directories (#1244)

* [oap-native-sql]Wip refine protobuf install (#1230)

* [Building] refine protobuf dependency check

 - if not found, download protobuf and statically link to it
 - if found, reuse system level protobuf and dynamically link to it

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Building] check for dynamic protobuf lib only

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [oap-native-sql][Scala] support date32 (#1225)

* [Scala] support date32

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++][Java] Support Date32 in RowToColumn

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] support date32 in unique action

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Java] fix getUTF8String on Date32

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* set C++ 2011 standard (#1236)

* [Scala] fix contain to use is_substr (#1235)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Java] fix date32 projection (#1250)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [NativeSql][Scala] memory leak track and fixes (#1227)

* [NativeSql][Scala] memory leak track and fixes

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [NativeSql][CPP] Another derived class should add virtual to its super destruction func

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [DataSource][Arrow] Supress exceptions from unexpected types when pushdown filters (#1253)

* Update README googletest installation (#1251)

* [DataSource][Arrow] Output schema mismatch when scanning for zero dat… (#1262)

* [DataSource][Arrow] Output schema mismatch when scanning for zero data columns

* [DataSource][Arrow] Use ArrowWritableColumnVector to fill partition values

* [DataSource][Arrow] Update README.md (#1263)

* [DataSource][Arrow] Add assembly build (#1264)

* [DataSource][Arrow] Download ArrowWritableColumnVector instead of having a copy (#1267)

* [oap-native-sql] Calling ColumnVectorUtils.populate(...) on ArrowWritableColumnVector leads to UnsupportedOperationException (#1268)

* [DataSource][Arrow] Source Downloading: Change to exec-maven-plugin (#1269)

* [DataSource][Arrow] Update README.md (#1276)

* [DataSource][Arrow] Update README.md (#1279)

* [Scala] adding IsNull support (#1256)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [oap-native-sql] Add open permission parameter (#1266)

* add open O_CREAT permission mode

* [DataSource][Arrow] Prune pushed filters that access partition columns (#1285)

* [oap-native-sql][Scala]Adding abs support (#1273)

* support abs

* [Building] building with spark-sql from our maven repo (#1249)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [DataSource][Arrow] Close batch every time new batch is read to avoid possible leaks (#1288)

* [DataSource][Arrow] File descriptor leak (#1295)

* inset (#1290)

* upper (#1301)

* [oap-native-sql][CI] update travis for native sql (#1294)

* [CI] update travis for native sql

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CI] fix grammar, use openjdk8

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CI] update to use python3 env

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Doc] update readme (#1308)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* coalesce (#1306)

* [oap-native-sql][Scala]adding if support (#1307)

* add IfOperator

* add boolean type

* [oap-native-sql] Enable ColumnarSort kernel with code generation (#1261)

* [NativeSql] ColumnarSort kernel

ColumnarSort is implemented with CodeGeneration method

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql] Fix compiling issue

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [Scala]support date32 in IN epxression (#1303)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* adding ASF license (#1331)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [CI] update to use new oap-master branch (#1342)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [oap-native-sql]Rewrite ColumnarShuffledHashJoin using CodeGeneration (#1324)

* [oap-native-sql] Rewrite ColumnarShuffledHashJoin using codegeneration

1. Remove unused files after we change to use codegen
2. Change to use SparseHashMap instead of arrow Hashing

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql] Use java.io.tmpdir or cmake build dir as codegen tmp dir

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql] Add copyright and change datatype in array_item_index to uint16_t

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql] Use add_definitions instead of add_compile_definitions

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* adding concat support (#1328)

* [oap-native-sql][Scala] refine coalesce (#1340)

* [oap-native-sql][Scala] fix null value exception for StringType and DoubleType (#1333)

* [oap-native-sql] Enable mvn package to build native libs (#1341)

* Enable mvn package to build native libs

* [oap-native-sql][Scala] fix attr errors (#1330)

* [oap-native-sql][Scala] adding round support (#1332)

* [oap-native-sql] columnar shuffle (oap-project#1212)

* [Scala/C++] columnar shuffle

* [Scala] sync with arrow-dataset

* [Scala] rebase to spark 3.1.0

* [Scala] fix & rebase to arrow 0.17

* [Java] serializer & typo

* [Scala] fix serializer & add data size SQLMetric

* [NativeSql][c++] Support date type

[NativeSql][Scala] support fall back row-based shuffle

* [NatvieSql][Scala] columnar shuffle configurable

* [NativeSql][Scala] serializer reference transfer & fix decompress

[NativeSql][c++] update deprecated

* [NativeSql][Scala] fix writer write columnar batch of 0 rows

* [NativeSql][Scala] read batch num rows metrics

* [NativeSql][Scala] configurable native buffer size

* [NativeSql][c++] optimize

[Scala] ColumnarShuffleExchange filter empty batch

* [NativeSql][Scala] fix extra close

* [NativeSql][Scala] coalesce batch

* [NativeSql][c++] find boost

* [NativeSql][Scala] fallback to use parquet data source

* Revert "[NativeSql][Scala] coalesce batch"

This reverts commit 4b6929920f19769051cc899ed244761bdfb43d47.

* [NativeSql] update README.md

* [NativeSql] ci install boost

* [NativeSql] add missing ASF & reformat

* [NativeSql][Scala] remove WSCG=false

* [oap-native-sql] Add customized batch_size and tmp_dir support (#1362)

* [oap-native-sql] Add API to use customized batchSize through spark config to native

* [oap-native-sql] Initialize ColumnarPluginConfig in operators

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql] Add cutomized tmp dir through spark config

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql]Wip optimize sort (#1372)

* [oap-native-sql] Use inplace sort for single key no payload batch

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql] Add ska_sort for single column with payload and use std::sort in desc case

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql] Add third party ska-sort

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

* [oap-native-sql] Columnar shuffle I/O Use Configured Disks (#1378)

* [NativeSql] shuffle I/O using spark configuration

* [NativeSql] some cleanup

* [C++] opt hash join (#1377)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [NativeSql][Scala] compression workaround (#1381)

* [DataSource][Arrow] Support reading dictionary encoded parquet values (#1376)

* [DataSource][Arrow] Support reading dictionary encoded parquet values

* CI uses Intel-bigdata/arrow/native-sql-engine-clean

* [oap-native-sql]Add ColumnarBatch Combination on Shuffle Read Side (#1370)

* [NativeSql][Scala] coalesce batch

* [NativeSql][Scala] use nano metrics

[NativeSql][Scala] add split metric to collect native split + write time, change write time metric to collect concat shuffle temp file time

* [NativeSql][Java] license & indent

* [NativeSql] rebase

* [NativeSql][c++] compress use single thread (#1394)

* [oap-native-sql][C++] extract codegen headers to nativesql_include folder (#1395)

* [C++] extract codegen headers to nativesql_include folder

So this won't conflict with zstd-jni

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [C++] support additional location of libarrow

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [DataSource][Arrow] Reserve buffer bytes from Spark off-heap executio… (#1393)

* [DataSource][Arrow] Reserve buffer bytes from Spark off-heap execution memory pool

* typo

* wip

* [oap-native-sql][Doc] update docs  (#1392)

* [Doc]wip refine doc

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Doc] refine wording and picture

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [oap-native-sql][Scala] support PartialMerge mode for aggregate (#1358)

* [oap-native-sql][Scala] support PartialMerge mode for aggregate (#1358)

* [Doc] fix wrong link to core arch picture (#1411)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [oap-native-sql][Scala] adding cast support (#1312)

* [Scala] adding cast support

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] disable castBIGINT

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] remove cast hack

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] fix getResultAttr in Cast

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [Scala] disable castDECIMAL and cleanup

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [DataSource][Arrow] Error when reading parquet file whose path contains character '%' (#1420)

* [DataSource][Arrow] Follow-up: A test case should be marked ignore (#1422)

* [ArrowDataSource][Scala] allow to specify batch size from Spark (#1416)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: rongma1997 <rong.ma@intel.com>
Co-authored-by: Chendi.Xue <xuechendi@gmail.com>
Co-authored-by: JiayiChen785 <59857593+JiayiChen785@users.noreply.github.com>
Co-authored-by: Hongze Zhang <mailtozhz@126.com>
Co-authored-by: Rui Mo <rui.mo@intel.com>
Co-authored-by: Hongze Zhang <hongze.zhang@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add performance result in README
2 participants