You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently on Spark runner, if single processElement call produces multiple output elements, they all needs to fit in the memory [1]. This is problematic e.g. for ParquetIO, which instead of Source<> based reads uses DoFn and let reader from inside DoFn push all elements to the output. Similar happens with JdbcIO and was discussed here [2].
The goal is to overcome this constraint and allow to produce large output from DoFn on Spark runner.
This was fixed for SparkRunner by adding and option to enable it via experiment. @mosche I wonder if it make sense to make necessary changes also for structured streaming or portable runner. What do you think?
@JozoVilcek It looks like SDFs on portable pipelines are expanded using a different mechanisms. Though, I haven't ever looked deeply into it to be honest.
In any case it makes sense to open a similar issue for the structured streaming runner 👍
What needs to happen?
Currently on Spark runner, if single processElement call produces multiple output elements, they all needs to fit in the memory [1]. This is problematic e.g. for
ParquetIO
, which instead ofSource<>
based reads usesDoFn
and let reader from inside DoFn push all elements to the output. Similar happens withJdbcIO
and was discussed here [2].The goal is to overcome this constraint and allow to produce large output from DoFn on Spark runner.
[1] https://github.com/apache/beam/blob/v2.39.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkProcessContext.java#L125
[2] https://www.mail-archive.com/dev@beam.apache.org/msg16806.html
Issue Priority
Priority: 2
Issue Component
Component: runner-spark
The text was updated successfully, but these errors were encountered: