[NSE-317]fix columnar cache #346

xuechendi · 2021-06-01T08:37:18Z

This PR is aim to fix current UT issues related with RDD cache

for tests wo/ Arrow Serialiazer, we now avoided using ColumarInMemoryTableScan
for tests w/ Arrow Serializer, while its cachedPlan doesn't support columnar, now we supported ConvertInternalRowToCachedBatch in ArrowSerializer
for tests w/ Arrow Serializer and its cachedPlan also supports columnar, we will use Arrow for fast cache.

TODO:
need to back port spark PR's for manual close cached blocks, https://issues.apache.org/jira/browse/SPARK-35396
need to test with our jupyter test(now only UT tested)

Fixed: #317

github-actions · 2021-06-01T08:37:34Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/native-sql-engine/issues

Then could you also rename commit message and pull request title in the following format?

[NSE-${ISSUES_ID}] ${detailed message}

See also:

Other pull requests

xuechendi · 2021-06-01T08:37:35Z

@rui-mo , please take a check

rui-mo · 2021-06-01T10:03:54Z

verified on my env. This pr cleans up my test code and refresh the ut: xuechendi#3

xuechendi · 2021-07-01T00:30:27Z

@zhouyuan , should be mergable

native_sql_path = "/mnt/nvme2/chendi/intel-bigdata/OAP/native-sql-engine/native-sql-engine/core/target/spark-columnar-core-1.2.0-snapshot-jar-with-dependencies.jar"
native_arrow_datasource_path = "/mnt/nvme2/chendi/intel-bigdata/OAP/native-sql-engine/arrow-data-source/standard/target/spark-arrow-datasource-standard-1.2.0-snapshot-jar-with-dependencies.jar"
spark = SparkSession.builder.master('yarn')\
        .appName("Recsys2021_data_process")\
        .config("spark.executorEnv.LD_LIBRARY_PATH", "/usr/local/lib64/")\
        .config("spark.driver.extraClassPath", 
                f"{native_sql_path}:{native_arrow_datasource_path}")\
        .config("spark.executor.extraClassPath",
                f"{native_sql_path}:{native_arrow_datasource_path}")\
        .config("spark.sql.extensions", "com.intel.oap.ColumnarPlugin")\
        .config("spark.shuffle.manager", "org.apache.spark.shuffle.sort.ColumnarShuffleManager")\
        .config("spark.sql.cache.serializer", "org.apache.spark.sql.execution.ArrowColumnarCachedBatchSerializer")\
        .config("spark.executor.memory", "10g")\
        .config("spark.executor.memoryOverhead", "16g")\
        .config("spark.memory.offHeap.use", "true")\
        .config("spark.memory.offHeap.size", "12G")\
        .config("spark.executor.extraJavaOptions", "-XX:MaxDirectMemorySize=25G")\
        .getOrCreate()

test: cache join result then do aggregate

df = spark.read.format('arrow').load("/recsys2021_0608")
dict_df = spark.read.parquet("/recsys2021_0608_processed/recsys_dicts/language")
df = df.select("tweet_id", "language", "tweet_timestamp", "engaged_with_user_id", "engaging_user_id")
df = df.join(dict_df.withColumnRenamed('dict_col', 'language'), 'language', 'left')
df.cache()
df.groupby('dict_col_id', 'language').count().show()

rui-mo · 2021-07-01T02:09:06Z

native-sql-engine/core/src/test/scala/org/apache/spark/sql/travis/TravisCachedTableSuite.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.travis


hi @xuechendi , we have changed this package "travis" into "nativesql", could you also put this file into nativesql folder?
This file should also be renamed into NativeCachedTableSuite.scala.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

…elease function Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

…zedCacheEntry to OffHeap Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

github-actions · 2021-07-06T10:40:16Z

#317

xuechendi force-pushed the wip_columnar_cache_fix_for_3.1.1 branch 2 times, most recently from 49d4e03 to acc178a Compare June 30, 2021 09:29

rui-mo reviewed Jul 1, 2021

View reviewed changes

xuechendi and others added 10 commits July 6, 2021 13:29

Remove legacy copied file to align with 3.1.1

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

78b9cfb

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Add kyro support for ArrowCachedBatch and remove Cleaner simply use r…

cd4765b

…elease function Signed-off-by: Chendi Xue <chendi.xue@intel.com>

align with on-going manual close PR

7c0963c

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Fallback when ArrowColumnarCachedBatchSerializer is not enabled

e0c6def

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Fixed issue when ArrowSerializer enabled while cache input is rowBased

a858452

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

remove unnecessary change and refresh ut

7cdc0de

backport manual native entry close in memorystore

60c460f

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Move ColumnarInMemoryRelation to spark package, and hard code deriali…

a0b738d

…zedCacheEntry to OffHeap Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Fix travis for package change

4fafa2b

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Move UT to nativeSql folder

7fb4ec3

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the wip_columnar_cache_fix_for_3.1.1 branch from 7c115dc to 7fb4ec3 Compare July 6, 2021 06:29

xuechendi changed the title ~~[DNM][NSE-317]Wip columnar cache fix for 3.1.1~~ [NSE-317]Wip columnar cache fix for 3.1.1 Jul 6, 2021

zhouyuan changed the title ~~[NSE-317]Wip columnar cache fix for 3.1.1~~ [NSE-317]fix columnar cache Jul 7, 2021

zhouyuan merged commit b584b08 into oap-project:master Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NSE-317]fix columnar cache #346

[NSE-317]fix columnar cache #346

xuechendi commented Jun 1, 2021

github-actions bot commented Jun 1, 2021

xuechendi commented Jun 1, 2021

rui-mo commented Jun 1, 2021 •

edited

Loading

xuechendi commented Jul 1, 2021

rui-mo Jul 1, 2021 •

edited

Loading

github-actions bot commented Jul 6, 2021

[NSE-317]fix columnar cache #346

[NSE-317]fix columnar cache #346

Conversation

xuechendi commented Jun 1, 2021

github-actions bot commented Jun 1, 2021

xuechendi commented Jun 1, 2021

rui-mo commented Jun 1, 2021 • edited Loading

xuechendi commented Jul 1, 2021

rui-mo Jul 1, 2021 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jul 6, 2021

rui-mo commented Jun 1, 2021 •

edited

Loading

rui-mo Jul 1, 2021 •

edited

Loading