-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task]: Spark runner flatMap output should not be required to fit in the memory #23852
Comments
.take-issue |
e-mail thread for collecting feedback on initial WIP implementation |
This was fixed for SparkRunner by adding and option to enable it via experiment. @mosche I wonder if it make sense to make necessary changes also for structured streaming or portable runner. What do you think? |
@JozoVilcek It looks like SDFs on portable pipelines are expanded using a different mechanisms. Though, I haven't ever looked deeply into it to be honest. |
What needs to happen?
Currently on Spark runner, if single processElement call produces multiple output elements, they all needs to fit in the memory [1]. This is problematic e.g. for
ParquetIO
, which instead ofSource<>
based reads usesDoFn
and let reader from inside DoFn push all elements to the output. Similar happens withJdbcIO
and was discussed here [2].The goal is to overcome this constraint and allow to produce large output from DoFn on Spark runner.
[1] https://github.com/apache/beam/blob/v2.39.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkProcessContext.java#L125
[2] https://www.mail-archive.com/dev@beam.apache.org/msg16806.html
Issue Priority
Priority: 2
Issue Component
Component: runner-spark
The text was updated successfully, but these errors were encountered: