[Task]: Spark runner flatMap output should not be required to fit in the memory #23852

JozoVilcek · 2022-10-26T14:14:18Z

What needs to happen?

Currently on Spark runner, if single processElement call produces multiple output elements, they all needs to fit in the memory [1]. This is problematic e.g. for ParquetIO, which instead of Source<> based reads uses DoFn and let reader from inside DoFn push all elements to the output. Similar happens with JdbcIO and was discussed here [2].

The goal is to overcome this constraint and allow to produce large output from DoFn on Spark runner.

[1] https://github.com/apache/beam/blob/v2.39.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkProcessContext.java#L125

[2] https://www.mail-archive.com/dev@beam.apache.org/msg16806.html

Issue Priority

Priority: 2

Issue Component

Component: runner-spark

The text was updated successfully, but these errors were encountered:

JozoVilcek · 2022-12-28T13:38:43Z

.take-issue

JozoVilcek · 2022-12-28T18:38:32Z

e-mail thread for collecting feedback on initial WIP implementation
https://www.mail-archive.com/dev@beam.apache.org/msg27521.html

JozoVilcek · 2023-02-02T11:45:59Z

This was fixed for SparkRunner by adding and option to enable it via experiment. @mosche I wonder if it make sense to make necessary changes also for structured streaming or portable runner. What do you think?

mosche · 2023-02-02T14:57:17Z

@JozoVilcek It looks like SDFs on portable pipelines are expanded using a different mechanisms. Though, I haven't ever looked deeply into it to be honest.
In any case it makes sense to open a similar issue for the structured streaming runner 👍

JozoVilcek added awaiting triage task labels Oct 26, 2022

mosche added spark runners labels Oct 27, 2022

manuzhang added P2 and removed awaiting triage labels Oct 31, 2022

github-actions bot assigned JozoVilcek Dec 28, 2022

JozoVilcek added a commit to JozoVilcek/beam that referenced this issue Dec 28, 2022

Enable async processing for SDF on Spark runner apache#23852

895c497

JozoVilcek added a commit to JozoVilcek/beam that referenced this issue Dec 29, 2022

Enable async processing for SDF on Spark runner apache#23852

fa2019f

JozoVilcek added a commit to JozoVilcek/beam that referenced this issue Dec 30, 2022

Enable async processing for SDF on Spark runner apache#23852

ba83018

JozoVilcek added a commit to JozoVilcek/beam that referenced this issue Jan 31, 2023

Enable async processing for SDF on Spark runner apache#23852

4fda1a5

mosche closed this as completed in 01aa470 Feb 2, 2023

github-actions bot added this to the 2.46.0 Release milestone Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task]: Spark runner flatMap output should not be required to fit in the memory #23852

[Task]: Spark runner flatMap output should not be required to fit in the memory #23852

JozoVilcek commented Oct 26, 2022 •

edited

Loading

JozoVilcek commented Dec 28, 2022

JozoVilcek commented Dec 28, 2022

JozoVilcek commented Feb 2, 2023

mosche commented Feb 2, 2023

[Task]: Spark runner flatMap output should not be required to fit in the memory #23852

[Task]: Spark runner flatMap output should not be required to fit in the memory #23852

Comments

JozoVilcek commented Oct 26, 2022 • edited Loading

What needs to happen?

Issue Priority

Issue Component

JozoVilcek commented Dec 28, 2022

JozoVilcek commented Dec 28, 2022

JozoVilcek commented Feb 2, 2023

mosche commented Feb 2, 2023

JozoVilcek commented Oct 26, 2022 •

edited

Loading