[NSE-90]Refactor HashAggregateExec and CPP kernels #91

xuechendi · 2021-02-04T03:46:55Z

Fixed: #90

Update:

Verified with TPCH SF500
Verified with TPCDS SF500

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

…ernel Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

github-actions · 2021-02-04T11:49:41Z

xuechendi · 2021-02-04T12:02:34Z

@zhouyuan , verified with SF500 TPCDS and TPCH, should be ok to merge

zhouyuan · 2021-02-04T13:38:35Z

@xuechendi 👍 i'll check this on my env

This commit implements the Native SQL Engine for OAP. The key components are: - Using Apache Arrow as column vector format as intermediate data among Spark operator. - Enable Apache Arrow native readers for Parquet and other formats. - Leverage Apache Arrow Gandiva/Compute to evaluate columnar expressions with SIMD optimizations OAP Native SQL Engine is verified by TPC-H workload as of this commit. Please refer to the detailed guide on how to install and test. Co-authored-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Rong Ma <rong.ma@intel.com> Co-authored-by: Jiayi Chen <Jiayi.chen@intel.com> Co-authored-by: Hongze Zhang <hongze.zhang@intel.com> Co-authored-by: Rui Mo <rui.mo@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com> Co-authored-by: Binwei Yang <binwei.yang@intel.com> ====================== * ProjectList prepare check and type change Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Add new ReadWriteBench Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Add ColumnarHashAggregate support Framework done, Codes workable, saw fault result Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Add an optimization to skip unnecessary project work Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Bug fix] fixed multiple cols aggregation failing issue Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update README.md * Update ApacheArrowInstallation.md * use arrow version property to record the right version Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * integrate columnar shuffle operator and relative UT * Update README.md * Update ApacheArrowInstallation.md * Update ApacheArrowInstallation.md * Apply coding format onto current project Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update README.md * [C++]Fix cpp code format Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Add files via upload * Add files via upload * [C++]Refactoring current cpp codes and change return using vector<RecordBatch> Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++]Using google-code-style for c++ codes Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [JAVA]change java jniWrapper to return a ArrowRecordBatch array Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [SCALA]Bug fixing: ColumnarShuffleExchangeExec didn't recursively pass child to next operator. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++]Remove Gandiva Protobuf in this project, added in arrow side Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [DOC] Fix installation guide after we remove gandiva_protobuf Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++]Add jni_common.h Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add splitArray function Aim to add a function to split one Array into multiple arrays with distinguish key. Codes are done, runable with correct result, will try bigger input. reminder: current we can only use one array as splitter. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Big fix when only splitting one array Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] refactoring codes to support a visitor chain Signed-off-by: Chendi Xue <chendi.xue@intel.com> * add CMakeLists.txt * [C++] Change splitArray to only use one loop for all arrays Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Noticed a kernel_ext bug, fix here, also refined a bit codes Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] based on arrow commit 868c8c6, to pass hash_table to arrow compute functions. Original, arrow will initialize a hash_table inside DictEncode function and ValueCounts function which leads to multiple array can't be processed based on same hash_table, and by changing arrow code, we now be able to pass a long live hash_table get an unified index for all arrays. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add new interface "finish" This interface is designed to evaluate based on multiple recordBatch will generate a output when calling finish Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] add a new appendArrayToBatch function Using this function, we can build a new recordbatch based on multiple recordBatch input, then we are able to make a final aggregate result for all. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Refactor split.h using array_builder_impl.h to maintain ArrayBuider Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [c++] Only use dict when splitting array Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update ApacheArrowInstallation.md * [C++] support groupby aggregate in cpp level 1. added EncodeArray kernel 2. added a finish function mechanism 3. added appendToCache functions 4. splitArrayList uses indices instead of cache the whole list Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala/Java] Added support for groupBy aggregate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add jni support for finish function Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Fix on groupby aggregate feature Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++/Scala] Move merge multiple groupby batch into one implementation to CPP Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++/Scala] Continuelly optimize groupby hash aggregation by using action Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Fix DataType issue for HashAggregate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Optimize GroupBy HashAggregate performance 1. Used hash_table key column as uniqueAction input, so uniqueAction won't need to calculate each time 2. Set max group id at the beginning of row evaluation, so each action evaluate only need to do the real evaluation work. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Use AppendValues to build Array Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Change Makefile using O3, which significantly improves performance Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add Spark Metrics for HashAggregate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Using MinMax in SplitArrayListWith Action to get max group id Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add a new way to directly access data instead of using Array API [C++] Using inline lambda instead of using function call in action Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Refactor arrow compute to simpify the workpath of calling Eval() Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Fix bug, ColumnarBatch was not closed before Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Disable ColumnarShuffle, looks it will cause OOM issue Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++/Scala] Use group when doing encodeArray and add a null check when closing ColumnarAggregation Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Bug fix, close the last columnarBatch and columnarAggregation instance Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Change to use cmake instead of makefile Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++]Add GoogleTests Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add a new unittest and macroed add_test Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Change to use groupby_aggregate.h Group func Rebased intel arrow to lastest arrow commit and revert our changes to hash.h, and move group function to groupby_aggregate.h Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Sort support To use sort, it requires two kernels, sortArraysToIndices will cook an indices array, then rest arrays can use this sorted_indices to do a shuffle. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Pass col type when make SplitArrayListWithAction Kernel Signed-off-by: Chendi Xue <chendi.xue@intel.com> * add AppendArrayKernel This patch adds AppendArrayKernel support. Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] Bug fix, noticed we didn't use the java builder inside this project Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Fix bug when there is finish Func in expr Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++/Scala] Refine the way of extract hash aggregate input expression Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Change unittest to use action_dono, so we can get multiple return_type Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Enable ColumnarSort Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add benchmark Benchmark will read a parquet file from local and do evaluation upon this file Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] ShuffleArrayList performance optimization Use builder directly instead of using array_builder_impl.h Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++]Add big scale test, batch size is 5176 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add Iterator<RecordBatch> as Finish Return Currently, we only supported return as std::vector<RecordBatch>, and I am thinking to add a new way of returning as iterator, to make it more extensible Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Jni + ColumnarSorter] use ResultIterator<RecordBatch> instead of return vector<RecordBatch> Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [BUG FIX] Fix uninitialized row_id bug Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [JAVA] adding missing BatchIterator file Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] allow to operate on Long and Double type Both Gandiva and Arrow Compute support these two types now. Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * adding vhashjoin support This patch adds vhashjoin support w/ below major change: - Allow to set member set for kernels - Adding Take&NTake kernels - Spark columnar plugin for ShuffledHashJoinExec(turned off now) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] Bug fix in ColumnarAggregate when some column will be trimed Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Implement this feature with two method: 1. Using utf8 to merge keys -> ConcatArrayKernel 2. use gandiva to do hash + add -> HashAggrArrayKernel Now we chose to use gandiva Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala]Some fixing to support ColumnarAggregationWithTwoKeys Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add ColumnarBatchScan Support By using which, we can use WSCG off when testing columnarBased process Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add a new ColumnarConditionProjector Operator Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP & Scala] Add desend and null first support for ColumnarSort Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Rename ColumnarCondProjExec to ColumnarConditionProjectExec Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Add an alternative ColumnarJoin implementation (oap-project#71) * [CPP]ShuffleArrayList kernel fix when null exists Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add Join Benchmark We used tpch lineitem and order table to test join, which contains 800+ batches Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] adding a new method for ColumnarJoin Add a new kernel called probeArrays, which is used to input multiple arrays one by one, then probe primary key by another sets of arrays. And also refined shuffleArrayListKernel, so by combining this two, we can join batches from two table together. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [JNI] Add jni support for using ResultIterator.Process Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Spark Columnar Support for ShuffledHashJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add New Unittest and BenchmarkTest for InnerJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Refactor current Join codes and support both right Join and InnerJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala]ColumnarShuffledHashJoin Refine for InnerJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala]ColumnarAggregate fix for Q4 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Remove JoinTime in ColumnarShuffledHashJoinExec and use one in ColumnarShuffledHashJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * fix cond projector without condition (oap-project#75) should project with resultSchema Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix string support for columnar projection (oap-project#76) * [Scala] fix string support for columnar projection Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix StringType convert Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] skip projector evaluate if filter has 0 row result Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix possible memory leak Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Add a new Action call CountLiterAction Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Support CountLiteral Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Wip avg support (oap-project#79) * [CPP] Enabled groupby avg, AvgByCount and SumCount kernel Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [JNI] Add a new interface called setReturnFields This interface is used to set result Schema when some of expressions return more than one fields and we can't use current gandiva expression to describe the schema. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] enable groupby avg Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Rewrite Unique Action and add String Support Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Remove Concat Kernal and Action and some codes refine Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] String fix Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update README.md * Update ApacheArrowInstallation.md * [CPP] Multiple Key Groupby fix and optimization Noticed before groupby with multiple key returns incorrect result, and this commit will fix this Also if multiple keys are all string, I will concat them with gandiva and do a hash firstly then doing encodeArray. By doing which, will be a little faster then directly hash and add Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] SplitArray optimization Move input array from lambda capture to class member, which will improve performance a lot. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Support Aggregation with projection inside case (oap-project#86) By this new fix, we are able to run unmodified TPCH Q1 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] adding support for starts_with & ends_with (oap-project#78) * [Scala] adding support for starts_with & ends_with Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] adding support for like Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix string like support Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] support substring Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Support String in ColumnarJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] LeftSemi Join support Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Continue fix aggregate issue for Q3 Now Q3 is runable Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Memory leak issue fixing Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP & Scala] Support multiple key join Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add groupby min and max and fix a bug in ShuffleArrayList Evaluate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add a new interface to get holder current size Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Refine current ConditionProjector codes 1. Use iterator instead of map in ConditionProjector, so we can skip empty columnarBatch as return 2. Fix several bugs and made input schema for condition and project more clear Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add a return column size in Columnar AggregatExpression Since we may have one scenario like avg, which inputs one col and expected two column as return in partial phase and input two cols and expect one at final phase. Which is also a fix for Q1 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] ColumnarShuffleHashJoin with Knownfloating expr Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [scala] support In (oap-project#91) * [scala] support In Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix get ordinal for ColumnarIn Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix get ordinal in agg (oap-project#92) a special fix for Q10 Spark will do normalization when float/doubt type as join key Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] A attr fix in ColumnarAggregation Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Revert "[Scala] fix get ordinal in agg (oap-project#92)" This reverts commit 9ed5992b63d7791e59a559c4902d7ca516d3e3b4. * [Scala] Fix for Q11 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add a new expression who will collect subquery result and as literal in gandiva Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] adding support for extract_year (oap-project#88) * [Scala] adding support for extract_year Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] cast utf8/int64 to date64 first Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] support DateType for Literal Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] add support for string Contains Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] use string based comparison for datetype Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] clean up Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Refine all Aggregation function and add SumCount, AvgByCount, Min and Max support Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] null key will be skipped in Groupby Case Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add native ResultIterator support for Groupby HashAggregate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] ColumnarHashAggregation and ColumnarProjection Refactor Extracted current projection codes from ColumnarAggregation and made as a single class, So we can apply ColumnarProjection to groupingExpression, aggregateExpression and resultExpression. Also added return by batch support in ColumnarAggregation, so we won't return too much lines which may result in memory leak. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] ColumnarConditionProjection fix after Aggregation Refine Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] extractYear fix to use Int32 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add a new interface to pass selectionVector 1. add selection support to evaluator and resultIterator 2. add selectionVector support to ProbeArrays 3. fix wo/ groupby aggregate result type issue Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add a new interface to pass selectionVector Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Using ConditionProjector to handler condition inside Join Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Support condition inside ColumnarJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] A walkaround to skip Condition when input doesn't contain this field Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Support multiple same primary key Join Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] shift groupby key hashed value then add to next one Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] support for Not (oap-project#80) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala & CPP] Fix ColumnarAggregation ResultIterator bug Original we used Slice array in native codes, and when we pass this array to Java, Slice configuration will be lost so we are getting incorrect result. Now we changed to build array inside ResultIterator Next function, and result is correct now. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] support case when (oap-project#100) * [Scala] support case when Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix EquealTo Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix agg in case when Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] restore BinaryOperator Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] clean up Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] Cast dataType in BinaryOperator Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala & CPP] Support Outer Join Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Fix when aggregationExpression is empty Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Wip condition join (oap-project#106) * [Scala & CPP] Support LeftAnti Join in ColumnarShuffledHashJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Move bindReference inside ColumnarConditionProjection Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add Native conditionedJoin This PR is aim to do runtime codegen so we can perform a conditioned join operation, Add a new ConditionedShuffleArrayList implementation Add a new ConditionedProbeArrays implementation Generate signature for codegen func, and use signature to check if lib exists Add NoneCondition Support Remove ShuffleArrayList implementation and change to use ConditionShuffleArrayList Remove not in use Kernels and Actions Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Support new conditionedJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Remove original probeArrays kernel Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [scala] support function with in operator (oap-project#107) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Use original shuffle codes here to improve performance Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Fix AvgByCount bug Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add In Support when doing codegen and forward unknown function Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Fix a small bug in ColumnarExpressionConverter for Like Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Move SparkColumnarPlugin to oap-native-sql folder Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Small fixes (oap-project#1184) Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Fixed a avg with groupby issue, now Q17 is correct (oap-project#1185) Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [DO NOT MERGE]WIP Q2 fix (oap-project#1187) * [CPP & Scala] Fixed some codes for ConditionedShuffle Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Q2_fix done Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Last commit invoked some mis-remove, fix here Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update README.md * [nativesql] fix compile against new arrow (oap-project#1189) * [nativesql] fix compile against new arrow Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] fix compile warning Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] remove unused headers Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * Update ApacheArrowInstallation.md * [nativesql]Wip spark rebase (oap-project#1202) * [nativesql] fix compile against new arrow Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] fix compile warning Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] remove unused headers Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [scala] fix spark reabasing Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [NativeSql] DeCouple Gandiva protobuf and hashing dependency (oap-project#1203) * Copied Arrow Hashing to our repo so newly modification won't break our builds Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [scala] fix spark reabasing Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Add protobuf inside native sql Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com> * [NativeSql]refactor native parquet reader/writer (oap-project#1205) * Remove sortArraysToIndices Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql] Move Parquet Reader and Writer into nativeSql Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql] Add libhdfs3.so to resource, which will be copied to /hadoop dir when doing make install Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add a parquet reader and writer adapter Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql] Refactor and move spark side commits to nativeSql 1. move parquet reader logic to nativesql 2. move ArrowWritableColumnVector to nativesql 3. Use postRule to call RowToArrowColumnVector 4. move cpp so to jar 5. remove benchmark folder 6. update readme Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql][CPP] Use CMake to download and compile protobuf Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update README.md * ArrowDataSource for Spark (#1226) * [oap-native-sql]Add Installation Notes (#1231) * add InstallationNotes to README * refine * refine * refine * [NativeSql] ClassCastException if non-parquet data source is used (#1238) * Move ArrowWritableColumnVector from org.apache to com.intel (#1243) * [DataSource] Compilation error due to multiple source directories (#1244) * [oap-native-sql]Wip refine protobuf install (#1230) * [Building] refine protobuf dependency check - if not found, download protobuf and statically link to it - if found, reuse system level protobuf and dynamically link to it Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Building] check for dynamic protobuf lib only Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [oap-native-sql][Scala] support date32 (#1225) * [Scala] support date32 Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++][Java] Support Date32 in RowToColumn Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] support date32 in unique action Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Java] fix getUTF8String on Date32 Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * set C++ 2011 standard (#1236) * [Scala] fix contain to use is_substr (#1235) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Java] fix date32 projection (#1250) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [NativeSql][Scala] memory leak track and fixes (#1227) * [NativeSql][Scala] memory leak track and fixes Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql][CPP] Another derived class should add virtual to its super destruction func Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [DataSource][Arrow] Supress exceptions from unexpected types when pushdown filters (#1253) * Update README googletest installation (#1251) * [DataSource][Arrow] Output schema mismatch when scanning for zero dat… (#1262) * [DataSource][Arrow] Output schema mismatch when scanning for zero data columns * [DataSource][Arrow] Use ArrowWritableColumnVector to fill partition values * [DataSource][Arrow] Update README.md (#1263) * [DataSource][Arrow] Add assembly build (#1264) * [DataSource][Arrow] Download ArrowWritableColumnVector instead of having a copy (#1267) * [oap-native-sql] Calling ColumnVectorUtils.populate(...) on ArrowWritableColumnVector leads to UnsupportedOperationException (#1268) * [DataSource][Arrow] Source Downloading: Change to exec-maven-plugin (#1269) * [DataSource][Arrow] Update README.md (#1276) * [DataSource][Arrow] Update README.md (#1279) * [Scala] adding IsNull support (#1256) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [oap-native-sql] Add open permission parameter (#1266) * add open O_CREAT permission mode * [DataSource][Arrow] Prune pushed filters that access partition columns (#1285) * [oap-native-sql][Scala]Adding abs support (#1273) * support abs * [Building] building with spark-sql from our maven repo (#1249) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [DataSource][Arrow] Close batch every time new batch is read to avoid possible leaks (#1288) * [DataSource][Arrow] File descriptor leak (#1295) * inset (#1290) * upper (#1301) * [oap-native-sql][CI] update travis for native sql (#1294) * [CI] update travis for native sql Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CI] fix grammar, use openjdk8 Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CI] update to use python3 env Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Doc] update readme (#1308) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * coalesce (#1306) * [oap-native-sql][Scala]adding if support (#1307) * add IfOperator * add boolean type * [oap-native-sql] Enable ColumnarSort kernel with code generation (#1261) * [NativeSql] ColumnarSort kernel ColumnarSort is implemented with CodeGeneration method Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql] Fix compiling issue Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala]support date32 in IN epxression (#1303) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * adding ASF license (#1331) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

This patch implements below main features for Native SQL engine: - ColumnarExchange support - runtime codegen for ColumnarShuffledHashJoin/ColumnarSort - Configurable batch size for Arrow Data Source - Support more Functions from TPCDS queries Please refer to the detailed guide on how to install and test. Co-authored-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Rong Ma <rong.ma@intel.com> Co-authored-by: Jiayi Chen <Jiayi.chen@intel.com> Co-authored-by: Hongze Zhang <hongze.zhang@intel.com> Co-authored-by: Rui Mo <rui.mo@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com> Co-authored-by: Binwei Yang <binwei.yang@intel.com> ================= * [C++]Add jni_common.h Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add splitArray function Aim to add a function to split one Array into multiple arrays with distinguish key. Codes are done, runable with correct result, will try bigger input. reminder: current we can only use one array as splitter. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Big fix when only splitting one array Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] refactoring codes to support a visitor chain Signed-off-by: Chendi Xue <chendi.xue@intel.com> * add CMakeLists.txt * [C++] Change splitArray to only use one loop for all arrays Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Noticed a kernel_ext bug, fix here, also refined a bit codes Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] based on arrow commit 868c8c6, to pass hash_table to arrow compute functions. Original, arrow will initialize a hash_table inside DictEncode function and ValueCounts function which leads to multiple array can't be processed based on same hash_table, and by changing arrow code, we now be able to pass a long live hash_table get an unified index for all arrays. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add new interface "finish" This interface is designed to evaluate based on multiple recordBatch will generate a output when calling finish Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] add a new appendArrayToBatch function Using this function, we can build a new recordbatch based on multiple recordBatch input, then we are able to make a final aggregate result for all. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Refactor split.h using array_builder_impl.h to maintain ArrayBuider Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [c++] Only use dict when splitting array Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update ApacheArrowInstallation.md * [C++] support groupby aggregate in cpp level 1. added EncodeArray kernel 2. added a finish function mechanism 3. added appendToCache functions 4. splitArrayList uses indices instead of cache the whole list Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala/Java] Added support for groupBy aggregate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add jni support for finish function Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Fix on groupby aggregate feature Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++/Scala] Move merge multiple groupby batch into one implementation to CPP Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++/Scala] Continuelly optimize groupby hash aggregation by using action Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Fix DataType issue for HashAggregate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Optimize GroupBy HashAggregate performance 1. Used hash_table key column as uniqueAction input, so uniqueAction won't need to calculate each time 2. Set max group id at the beginning of row evaluation, so each action evaluate only need to do the real evaluation work. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Use AppendValues to build Array Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Change Makefile using O3, which significantly improves performance Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add Spark Metrics for HashAggregate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Using MinMax in SplitArrayListWith Action to get max group id Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add a new way to directly access data instead of using Array API [C++] Using inline lambda instead of using function call in action Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Refactor arrow compute to simpify the workpath of calling Eval() Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Fix bug, ColumnarBatch was not closed before Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Disable ColumnarShuffle, looks it will cause OOM issue Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++/Scala] Use group when doing encodeArray and add a null check when closing ColumnarAggregation Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Bug fix, close the last columnarBatch and columnarAggregation instance Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Change to use cmake instead of makefile Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++]Add GoogleTests Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add a new unittest and macroed add_test Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Change to use groupby_aggregate.h Group func Rebased intel arrow to lastest arrow commit and revert our changes to hash.h, and move group function to groupby_aggregate.h Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Sort support To use sort, it requires two kernels, sortArraysToIndices will cook an indices array, then rest arrays can use this sorted_indices to do a shuffle. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Pass col type when make SplitArrayListWithAction Kernel Signed-off-by: Chendi Xue <chendi.xue@intel.com> * add AppendArrayKernel This patch adds AppendArrayKernel support. Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] Bug fix, noticed we didn't use the java builder inside this project Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Fix bug when there is finish Func in expr Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++/Scala] Refine the way of extract hash aggregate input expression Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Change unittest to use action_dono, so we can get multiple return_type Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Enable ColumnarSort Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add benchmark Benchmark will read a parquet file from local and do evaluation upon this file Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] ShuffleArrayList performance optimization Use builder directly instead of using array_builder_impl.h Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++]Add big scale test, batch size is 5176 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [C++] Add Iterator<RecordBatch> as Finish Return Currently, we only supported return as std::vector<RecordBatch>, and I am thinking to add a new way of returning as iterator, to make it more extensible Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Jni + ColumnarSorter] use ResultIterator<RecordBatch> instead of return vector<RecordBatch> Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [BUG FIX] Fix uninitialized row_id bug Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [JAVA] adding missing BatchIterator file Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] allow to operate on Long and Double type Both Gandiva and Arrow Compute support these two types now. Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * adding vhashjoin support This patch adds vhashjoin support w/ below major change: - Allow to set member set for kernels - Adding Take&NTake kernels - Spark columnar plugin for ShuffledHashJoinExec(turned off now) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] Bug fix in ColumnarAggregate when some column will be trimed Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Implement this feature with two method: 1. Using utf8 to merge keys -> ConcatArrayKernel 2. use gandiva to do hash + add -> HashAggrArrayKernel Now we chose to use gandiva Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala]Some fixing to support ColumnarAggregationWithTwoKeys Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add ColumnarBatchScan Support By using which, we can use WSCG off when testing columnarBased process Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add a new ColumnarConditionProjector Operator Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP & Scala] Add desend and null first support for ColumnarSort Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Rename ColumnarCondProjExec to ColumnarConditionProjectExec Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Add an alternative ColumnarJoin implementation (oap-project#71) * [CPP]ShuffleArrayList kernel fix when null exists Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add Join Benchmark We used tpch lineitem and order table to test join, which contains 800+ batches Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] adding a new method for ColumnarJoin Add a new kernel called probeArrays, which is used to input multiple arrays one by one, then probe primary key by another sets of arrays. And also refined shuffleArrayListKernel, so by combining this two, we can join batches from two table together. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [JNI] Add jni support for using ResultIterator.Process Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Spark Columnar Support for ShuffledHashJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add New Unittest and BenchmarkTest for InnerJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Refactor current Join codes and support both right Join and InnerJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala]ColumnarShuffledHashJoin Refine for InnerJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala]ColumnarAggregate fix for Q4 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Remove JoinTime in ColumnarShuffledHashJoinExec and use one in ColumnarShuffledHashJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * fix cond projector without condition (oap-project#75) should project with resultSchema Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix string support for columnar projection (oap-project#76) * [Scala] fix string support for columnar projection Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix StringType convert Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] skip projector evaluate if filter has 0 row result Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix possible memory leak Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Add a new Action call CountLiterAction Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Support CountLiteral Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Wip avg support (oap-project#79) * [CPP] Enabled groupby avg, AvgByCount and SumCount kernel Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [JNI] Add a new interface called setReturnFields This interface is used to set result Schema when some of expressions return more than one fields and we can't use current gandiva expression to describe the schema. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] enable groupby avg Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Rewrite Unique Action and add String Support Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Remove Concat Kernal and Action and some codes refine Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] String fix Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update README.md * Update ApacheArrowInstallation.md * [CPP] Multiple Key Groupby fix and optimization Noticed before groupby with multiple key returns incorrect result, and this commit will fix this Also if multiple keys are all string, I will concat them with gandiva and do a hash firstly then doing encodeArray. By doing which, will be a little faster then directly hash and add Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] SplitArray optimization Move input array from lambda capture to class member, which will improve performance a lot. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Support Aggregation with projection inside case (oap-project#86) By this new fix, we are able to run unmodified TPCH Q1 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] adding support for starts_with & ends_with (oap-project#78) * [Scala] adding support for starts_with & ends_with Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] adding support for like Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix string like support Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] support substring Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Support String in ColumnarJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] LeftSemi Join support Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Continue fix aggregate issue for Q3 Now Q3 is runable Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Memory leak issue fixing Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP & Scala] Support multiple key join Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add groupby min and max and fix a bug in ShuffleArrayList Evaluate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add a new interface to get holder current size Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Refine current ConditionProjector codes 1. Use iterator instead of map in ConditionProjector, so we can skip empty columnarBatch as return 2. Fix several bugs and made input schema for condition and project more clear Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add a return column size in Columnar AggregatExpression Since we may have one scenario like avg, which inputs one col and expected two column as return in partial phase and input two cols and expect one at final phase. Which is also a fix for Q1 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] ColumnarShuffleHashJoin with Knownfloating expr Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [scala] support In (oap-project#91) * [scala] support In Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix get ordinal for ColumnarIn Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix get ordinal in agg (oap-project#92) a special fix for Q10 Spark will do normalization when float/doubt type as join key Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] A attr fix in ColumnarAggregation Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Revert "[Scala] fix get ordinal in agg (oap-project#92)" This reverts commit 9ed5992b63d7791e59a559c4902d7ca516d3e3b4. * [Scala] Fix for Q11 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add a new expression who will collect subquery result and as literal in gandiva Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] adding support for extract_year (oap-project#88) * [Scala] adding support for extract_year Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] cast utf8/int64 to date64 first Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] support DateType for Literal Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] add support for string Contains Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] use string based comparison for datetype Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] clean up Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Refine all Aggregation function and add SumCount, AvgByCount, Min and Max support Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] null key will be skipped in Groupby Case Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add native ResultIterator support for Groupby HashAggregate Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] ColumnarHashAggregation and ColumnarProjection Refactor Extracted current projection codes from ColumnarAggregation and made as a single class, So we can apply ColumnarProjection to groupingExpression, aggregateExpression and resultExpression. Also added return by batch support in ColumnarAggregation, so we won't return too much lines which may result in memory leak. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] ColumnarConditionProjection fix after Aggregation Refine Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] extractYear fix to use Int32 Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add a new interface to pass selectionVector 1. add selection support to evaluator and resultIterator 2. add selectionVector support to ProbeArrays 3. fix wo/ groupby aggregate result type issue Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Add a new interface to pass selectionVector Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Using ConditionProjector to handler condition inside Join Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Support condition inside ColumnarJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] A walkaround to skip Condition when input doesn't contain this field Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Support multiple same primary key Join Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] shift groupby key hashed value then add to next one Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] support for Not (oap-project#80) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala & CPP] Fix ColumnarAggregation ResultIterator bug Original we used Slice array in native codes, and when we pass this array to Java, Slice configuration will be lost so we are getting incorrect result. Now we changed to build array inside ResultIterator Next function, and result is correct now. Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] support case when (oap-project#100) * [Scala] support case when Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix EquealTo Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix agg in case when Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] restore BinaryOperator Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] clean up Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] Cast dataType in BinaryOperator Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala & CPP] Support Outer Join Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Fix when aggregationExpression is empty Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Wip condition join (oap-project#106) * [Scala & CPP] Support LeftAnti Join in ColumnarShuffledHashJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Move bindReference inside ColumnarConditionProjection Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add Native conditionedJoin This PR is aim to do runtime codegen so we can perform a conditioned join operation, Add a new ConditionedShuffleArrayList implementation Add a new ConditionedProbeArrays implementation Generate signature for codegen func, and use signature to check if lib exists Add NoneCondition Support Remove ShuffleArrayList implementation and change to use ConditionShuffleArrayList Remove not in use Kernels and Actions Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Support new conditionedJoin Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Remove original probeArrays kernel Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [scala] support function with in operator (oap-project#107) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Use original shuffle codes here to improve performance Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Fix AvgByCount bug Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add In Support when doing codegen and forward unknown function Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala] Fix a small bug in ColumnarExpressionConverter for Like Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Move SparkColumnarPlugin to oap-native-sql folder Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Small fixes (oap-project#1184) Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Fixed a avg with groupby issue, now Q17 is correct (oap-project#1185) Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [DO NOT MERGE]WIP Q2 fix (oap-project#1187) * [CPP & Scala] Fixed some codes for ConditionedShuffle Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Q2_fix done Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Last commit invoked some mis-remove, fix here Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update README.md * [nativesql] fix compile against new arrow (oap-project#1189) * [nativesql] fix compile against new arrow Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] fix compile warning Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] remove unused headers Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * Update ApacheArrowInstallation.md * [nativesql]Wip spark rebase (oap-project#1202) * [nativesql] fix compile against new arrow Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] fix compile warning Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] remove unused headers Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [scala] fix spark reabasing Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [NativeSql] DeCouple Gandiva protobuf and hashing dependency (oap-project#1203) * Copied Arrow Hashing to our repo so newly modification won't break our builds Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [scala] fix spark reabasing Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CPP] Add protobuf inside native sql Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com> * [NativeSql]refactor native parquet reader/writer (oap-project#1205) * Remove sortArraysToIndices Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql] Move Parquet Reader and Writer into nativeSql Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql] Add libhdfs3.so to resource, which will be copied to /hadoop dir when doing make install Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [CPP] Add a parquet reader and writer adapter Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql] Refactor and move spark side commits to nativeSql 1. move parquet reader logic to nativesql 2. move ArrowWritableColumnVector to nativesql 3. Use postRule to call RowToArrowColumnVector 4. move cpp so to jar 5. remove benchmark folder 6. update readme Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql][CPP] Use CMake to download and compile protobuf Signed-off-by: Chendi Xue <chendi.xue@intel.com> * Update README.md * ArrowDataSource for Spark (#1226) * [oap-native-sql]Add Installation Notes (#1231) * add InstallationNotes to README * refine * refine * refine * [NativeSql] ClassCastException if non-parquet data source is used (#1238) * Move ArrowWritableColumnVector from org.apache to com.intel (#1243) * [DataSource] Compilation error due to multiple source directories (#1244) * [oap-native-sql]Wip refine protobuf install (#1230) * [Building] refine protobuf dependency check - if not found, download protobuf and statically link to it - if found, reuse system level protobuf and dynamically link to it Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Building] check for dynamic protobuf lib only Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [oap-native-sql][Scala] support date32 (#1225) * [Scala] support date32 Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++][Java] Support Date32 in RowToColumn Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] support date32 in unique action Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Java] fix getUTF8String on Date32 Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * set C++ 2011 standard (#1236) * [Scala] fix contain to use is_substr (#1235) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Java] fix date32 projection (#1250) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [NativeSql][Scala] memory leak track and fixes (#1227) * [NativeSql][Scala] memory leak track and fixes Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [NativeSql][CPP] Another derived class should add virtual to its super destruction func Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [DataSource][Arrow] Supress exceptions from unexpected types when pushdown filters (#1253) * Update README googletest installation (#1251) * [DataSource][Arrow] Output schema mismatch when scanning for zero dat… (#1262) * [DataSource][Arrow] Output schema mismatch when scanning for zero data columns * [DataSource][Arrow] Use ArrowWritableColumnVector to fill partition values * [DataSource][Arrow] Update README.md (#1263) * [DataSource][Arrow] Add assembly build (#1264) * [DataSource][Arrow] Download ArrowWritableColumnVector instead of having a copy (#1267) * [oap-native-sql] Calling ColumnVectorUtils.populate(...) on ArrowWritableColumnVector leads to UnsupportedOperationException (#1268) * [DataSource][Arrow] Source Downloading: Change to exec-maven-plugin (#1269) * [DataSource][Arrow] Update README.md (#1276) * [DataSource][Arrow] Update README.md (#1279) * [Scala] adding IsNull support (#1256) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [oap-native-sql] Add open permission parameter (#1266) * add open O_CREAT permission mode * [DataSource][Arrow] Prune pushed filters that access partition columns (#1285) * [oap-native-sql][Scala]Adding abs support (#1273) * support abs * [Building] building with spark-sql from our maven repo (#1249) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [DataSource][Arrow] Close batch every time new batch is read to avoid possible leaks (#1288) * [DataSource][Arrow] File descriptor leak (#1295) * inset (#1290) * upper (#1301) * [oap-native-sql][CI] update travis for native sql (#1294) * [CI] update travis for native sql Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CI] fix grammar, use openjdk8 Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CI] update to use python3 env Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Doc] update readme (#1308) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * coalesce (#1306) * [oap-native-sql][Scala]adding if support (#1307) * add IfOperator * add boolean type * [oap-native-sql] Enable ColumnarSort kernel with code generation (#1261) * [NativeSql] ColumnarSort kernel ColumnarSort is implemented with CodeGeneration method Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql] Fix compiling issue Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [Scala]support date32 in IN epxression (#1303) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * adding ASF license (#1331) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [CI] update to use new oap-master branch (#1342) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [oap-native-sql]Rewrite ColumnarShuffledHashJoin using CodeGeneration (#1324) * [oap-native-sql] Rewrite ColumnarShuffledHashJoin using codegeneration 1. Remove unused files after we change to use codegen 2. Change to use SparseHashMap instead of arrow Hashing Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql] Use java.io.tmpdir or cmake build dir as codegen tmp dir Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql] Add copyright and change datatype in array_item_index to uint16_t Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql] Use add_definitions instead of add_compile_definitions Signed-off-by: Chendi Xue <chendi.xue@intel.com> * adding concat support (#1328) * [oap-native-sql][Scala] refine coalesce (#1340) * [oap-native-sql][Scala] fix null value exception for StringType and DoubleType (#1333) * [oap-native-sql] Enable mvn package to build native libs (#1341) * Enable mvn package to build native libs * [oap-native-sql][Scala] fix attr errors (#1330) * [oap-native-sql][Scala] adding round support (#1332) * [oap-native-sql] columnar shuffle (oap-project#1212) * [Scala/C++] columnar shuffle * [Scala] sync with arrow-dataset * [Scala] rebase to spark 3.1.0 * [Scala] fix & rebase to arrow 0.17 * [Java] serializer & typo * [Scala] fix serializer & add data size SQLMetric * [NativeSql][c++] Support date type [NativeSql][Scala] support fall back row-based shuffle * [NatvieSql][Scala] columnar shuffle configurable * [NativeSql][Scala] serializer reference transfer & fix decompress [NativeSql][c++] update deprecated * [NativeSql][Scala] fix writer write columnar batch of 0 rows * [NativeSql][Scala] read batch num rows metrics * [NativeSql][Scala] configurable native buffer size * [NativeSql][c++] optimize [Scala] ColumnarShuffleExchange filter empty batch * [NativeSql][Scala] fix extra close * [NativeSql][Scala] coalesce batch * [NativeSql][c++] find boost * [NativeSql][Scala] fallback to use parquet data source * Revert "[NativeSql][Scala] coalesce batch" This reverts commit 4b6929920f19769051cc899ed244761bdfb43d47. * [NativeSql] update README.md * [NativeSql] ci install boost * [NativeSql] add missing ASF & reformat * [NativeSql][Scala] remove WSCG=false * [oap-native-sql] Add customized batch_size and tmp_dir support (#1362) * [oap-native-sql] Add API to use customized batchSize through spark config to native * [oap-native-sql] Initialize ColumnarPluginConfig in operators Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql] Add cutomized tmp dir through spark config Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql]Wip optimize sort (#1372) * [oap-native-sql] Use inplace sort for single key no payload batch Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql] Add ska_sort for single column with payload and use std::sort in desc case Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql] Add third party ska-sort Signed-off-by: Chendi Xue <chendi.xue@intel.com> * [oap-native-sql] Columnar shuffle I/O Use Configured Disks (#1378) * [NativeSql] shuffle I/O using spark configuration * [NativeSql] some cleanup * [C++] opt hash join (#1377) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [NativeSql][Scala] compression workaround (#1381) * [DataSource][Arrow] Support reading dictionary encoded parquet values (#1376) * [DataSource][Arrow] Support reading dictionary encoded parquet values * CI uses Intel-bigdata/arrow/native-sql-engine-clean * [oap-native-sql]Add ColumnarBatch Combination on Shuffle Read Side (#1370) * [NativeSql][Scala] coalesce batch * [NativeSql][Scala] use nano metrics [NativeSql][Scala] add split metric to collect native split + write time, change write time metric to collect concat shuffle temp file time * [NativeSql][Java] license & indent * [NativeSql] rebase * [NativeSql][c++] compress use single thread (#1394) * [oap-native-sql][C++] extract codegen headers to nativesql_include folder (#1395) * [C++] extract codegen headers to nativesql_include folder So this won't conflict with zstd-jni Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [C++] support additional location of libarrow Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [DataSource][Arrow] Reserve buffer bytes from Spark off-heap executio… (#1393) * [DataSource][Arrow] Reserve buffer bytes from Spark off-heap execution memory pool * typo * wip * [oap-native-sql][Doc] update docs (#1392) * [Doc]wip refine doc Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Doc] refine wording and picture Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [oap-native-sql][Scala] support PartialMerge mode for aggregate (#1358) * [oap-native-sql][Scala] support PartialMerge mode for aggregate (#1358) * [Doc] fix wrong link to core arch picture (#1411) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [oap-native-sql][Scala] adding cast support (#1312) * [Scala] adding cast support Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] disable castBIGINT Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] remove cast hack Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] fix getResultAttr in Cast Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [Scala] disable castDECIMAL and cleanup Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [DataSource][Arrow] Error when reading parquet file whose path contains character '%' (#1420) * [DataSource][Arrow] Follow-up: A test case should be marked ignore (#1422) * [ArrowDataSource][Scala] allow to specify batch size from Spark (#1416) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: rongma1997 <rong.ma@intel.com> Co-authored-by: Chendi.Xue <xuechendi@gmail.com> Co-authored-by: JiayiChen785 <59857593+JiayiChen785@users.noreply.github.com> Co-authored-by: Hongze Zhang <mailtozhz@126.com> Co-authored-by: Rui Mo <rui.mo@intel.com> Co-authored-by: Hongze Zhang <hongze.zhang@intel.com>

xuechendi changed the title ~~[NSE-90]Refactor HashAggregateExec and CPP kernels~~ [DNM][NSE-90]Refactor HashAggregateExec and CPP kernels Feb 4, 2021

xuechendi added 3 commits February 4, 2021 17:49

Refactor HashAggregate scala codes

8636deb

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

refactor cpp codes to handle w/wo Groupby aggregate in HashAggregateK…

3ab2ed5

…ernel Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Fix issue when spark only count numRows with no input

8222067

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the wip_standalone_hashaggr_1 branch from 2de4c13 to 8222067 Compare February 4, 2021 10:33

xuechendi changed the title ~~[DNM][NSE-90]Refactor HashAggregateExec and CPP kernels~~ [NSE-90]Refactor HashAggregateExec and CPP kernels Feb 4, 2021

xuechendi merged commit a314f08 into oap-project:master Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NSE-90]Refactor HashAggregateExec and CPP kernels #91

[NSE-90]Refactor HashAggregateExec and CPP kernels #91

xuechendi commented Feb 4, 2021 •

edited

Loading

github-actions bot commented Feb 4, 2021

xuechendi commented Feb 4, 2021

zhouyuan commented Feb 4, 2021

[NSE-90]Refactor HashAggregateExec and CPP kernels #91

[NSE-90]Refactor HashAggregateExec and CPP kernels #91

Conversation

xuechendi commented Feb 4, 2021 • edited Loading

github-actions bot commented Feb 4, 2021

xuechendi commented Feb 4, 2021

zhouyuan commented Feb 4, 2021

xuechendi commented Feb 4, 2021 •

edited

Loading