Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[NSE-461]columnar shuffle support for ArrayType #496

Merged

Conversation

xuechendi
Copy link
Collaborator

Fixed: #461

@github-actions
Copy link

#461

@xuechendi
Copy link
Collaborator Author

Now codes only support when compression is disabled.

spark = SparkSession.builder.master('local[*]')\
    .appName("udf_column")\
    .config("spark.sql.broadcastTimeout", "7200")\
    .config("spark.cleaner.periodicGC.interval", "10min")\
    .config("spark.driver.extraClassPath",
            f"{native_sql_path}:{native_arrow_datasource_path}")\
    .config("spark.sql.extensions", "com.intel.oap.ColumnarPlugin, com.intel.oap.spark.sql.ArrowWriteExtension")\
    .config("spark.shuffle.manager", "org.apache.spark.shuffle.sort.ColumnarShuffleManager")\
    .config("spark.driver.memory", "200G")\
    .config("spark.driver.memoryOverhead", "300G")\
    .config("spark.memory.offHeap.size", "300G")\
    .config("spark.oap.sql.columnar.arrowudf", "true")\
    .getOrCreate()
    #.config("spark.shuffle.compress", "false")\

df = spark.read.format("arrow").load(path_prefix + original_folder)
df = df.filter("engaged_with_user_id is not null")
df = df.groupby('engaged_with_user_id').agg(f.collect_list('present_media').alias('posted_tweet_types'), f.first('language').alias('language'))
df = df.repartition(10)
df.show(truncate=False)

image

@xuechendi xuechendi force-pushed the wip_ColumnarShuffleSplitting_for_arr branch 2 times, most recently from eacfd38 to 926e9fb Compare September 1, 2021 07:12
@xuechendi xuechendi force-pushed the wip_ColumnarShuffleSplitting_for_arr branch from 926e9fb to 81f7278 Compare September 2, 2021 00:37
@xuechendi xuechendi changed the title [NSE-461][DNM]columnar shuffle support for ArrayType [NSE-461]columnar shuffle support for ArrayType Sep 2, 2021
@xuechendi xuechendi force-pushed the wip_ColumnarShuffleSplitting_for_arr branch 4 times, most recently from 364550d to 3969657 Compare September 8, 2021 06:59
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the wip_ColumnarShuffleSplitting_for_arr branch from 3969657 to 6c9e746 Compare September 8, 2021 09:03
@xuechendi xuechendi merged commit 6f3041e into oap-project:master Sep 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ArrayType in Gazelle
2 participants