Async write support for ORC #11865

jihoonson · 2024-12-11T21:49:15Z

Follow up to #11730.

This PR adds the async write support for the ORC format as well as an integration test for it. This PR is marked as a draft since it depends on #11855.

jihoonson · 2024-12-11T21:50:28Z

build

jihoonson · 2024-12-17T21:26:42Z

build

Signed-off-by: Jihoon Son <ghoonson@gmail.com>

jihoonson · 2024-12-18T19:47:42Z

build

integration_tests/src/main/python/orc_write_test.py

abellina · 2024-12-18T19:50:52Z

Any indication that this is a performance improvement for Orc? I assume so, but asking if you have data.

revans2

Looks good to me, but want to hear about the coalesce question from @abellina

tools/generated_files/351/supportedExprs.csv

mythrocks

LGTM, barring the accidental inclusion of operatorsScore.csv and supportedExprs.csv.

jihoonson · 2024-12-19T01:11:46Z

Any indication that this is a performance improvement for Orc? I assume so, but asking if you have data.

@abellina I haven't run any performance test yet, but agree it will be useful. Unfortunately, I'm quite stuck in other work right now. Do you mind if I run some benchmark later and post the results here?

jihoonson · 2024-12-19T01:12:49Z

build

jihoonson · 2024-12-19T18:39:59Z

integration_tests/src/main/python/orc_write_test.py

+    gen_list = [('_c' + str(i), gen) for i, gen in enumerate(orc_gen)]
+    assert_gpu_and_cpu_writes_are_equal_collect(
+        lambda spark, path: gen_df(spark, gen_list, length=num_rows).write.orc(path),
+        lambda spark, path: spark.read.orc(path).orderBy([('_c' + str(i)) for i in range(num_cols)]),


It seems that the GPU and CPU orc readers can return the same rows in different orders. Added an orderBy to make the test deterministic.

jihoonson · 2024-12-19T18:40:09Z

build

jihoonson · 2024-12-20T01:08:39Z

The test failure is filed in #11897.

jihoonson · 2024-12-20T04:44:44Z

build

jihoonson changed the title ~~Async orc write~~ Async write support for ORC Dec 11, 2024

jihoonson force-pushed the async-orc-write branch from 8d4a241 to 3b01026 Compare December 11, 2024 21:50

jihoonson marked this pull request as ready for review December 14, 2024 01:18

jihoonson added 2 commits December 18, 2024 11:38

Async write support for ORC writer

5f23f4a

Signed-off-by: Jihoon Son <ghoonson@gmail.com>

doc change

a4b1f65

jihoonson force-pushed the async-orc-write branch from 99d0c24 to a4b1f65 Compare December 18, 2024 19:47

abellina reviewed Dec 18, 2024

View reviewed changes

integration_tests/src/main/python/orc_write_test.py Outdated Show resolved Hide resolved

revans2 reviewed Dec 18, 2024

View reviewed changes

mythrocks reviewed Dec 18, 2024

View reviewed changes

tools/generated_files/351/supportedExprs.csv Outdated Show resolved Hide resolved

mythrocks reviewed Dec 18, 2024

View reviewed changes

jihoonson added 2 commits December 18, 2024 17:04

remove unnecessary coalesce in the tests

9a9cc48

revert unrelated change

ed3cdc0

sort results

942821f

jihoonson commented Dec 19, 2024

View reviewed changes

jihoonson requested review from abellina, revans2 and mythrocks December 20, 2024 17:31

abellina approved these changes Dec 20, 2024

View reviewed changes

revans2 approved these changes Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async write support for ORC #11865

Async write support for ORC #11865

jihoonson commented Dec 11, 2024

jihoonson commented Dec 11, 2024

jihoonson commented Dec 17, 2024

jihoonson commented Dec 18, 2024

abellina commented Dec 18, 2024 •

edited

Loading

revans2 left a comment

mythrocks left a comment

jihoonson commented Dec 19, 2024 •

edited

Loading

jihoonson commented Dec 19, 2024

jihoonson Dec 19, 2024

jihoonson commented Dec 19, 2024

jihoonson commented Dec 20, 2024

jihoonson commented Dec 20, 2024

Async write support for ORC #11865

Are you sure you want to change the base?

Async write support for ORC #11865

Conversation

jihoonson commented Dec 11, 2024

jihoonson commented Dec 11, 2024

jihoonson commented Dec 17, 2024

jihoonson commented Dec 18, 2024

abellina commented Dec 18, 2024 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

jihoonson commented Dec 19, 2024 • edited Loading

jihoonson commented Dec 19, 2024

jihoonson Dec 19, 2024

Choose a reason for hiding this comment

jihoonson commented Dec 19, 2024

jihoonson commented Dec 20, 2024

jihoonson commented Dec 20, 2024

abellina commented Dec 18, 2024 •

edited

Loading

jihoonson commented Dec 19, 2024 •

edited

Loading