-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial DaskRunner for Beam #22421
Initial DaskRunner for Beam #22421
Conversation
Codecov Report
@@ Coverage Diff @@
## master #22421 +/- ##
==========================================
- Coverage 73.35% 73.21% -0.15%
==========================================
Files 719 728 +9
Lines 95800 96272 +472
==========================================
+ Hits 70276 70482 +206
- Misses 24212 24479 +267
+ Partials 1312 1311 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
b96d3cd
to
3c4204d
Compare
@TomAugspurger: I'm having trouble running my unit tests. My tests used to work, but now I'm noticing infinite loops when running them on a local cluster (the default scheduler). In my last commit, I changed the Do you have any idea of what's going on? One key difference between my set up now and when I wrote this is that I'm not on a M1 Mac (ARM64). Could this cause my problem? |
The
looks incorrect inside of a regular Can you expand on the desire for |
|
||
dask_options = options.view_as(DaskOptions).get_all_options( | ||
drop_default=True) | ||
client = ddist.Client(**dask_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does Beam typically handle the lifetime of runners? In the tests, I see warnings about re-using port 8787 from Dask, since the client
(and cluster) aren't being completely cleaned up between tests.
Is it more common for beam to create (and clean up) the runner? Or would users typically create it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my first runner – @pabloem can probably weigh in better than I can wrt your question. However, what makes sense to me is that each Beam runner should clean up its environment between each run, including in tests.
This probably should happen in the DaskRunnerResult
object. Do you have any recommendations on the best way to clean up dask (distributed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a single scope,
with distributed.Client(...) as client:
...
But in this case, as you say, you'll need to call it after the results are done. So I think that something like
client.close()
client.cluster.close()
should do the trick (assuming that beam is the one managing the lifetime of the client.
If you want to rely on the user having a client active, you can call dask.distributed.get_client()
, which will raise a ValueError
if one hasn't already been created.
Yes – thanks for pointing this out. This makes sense to me, looking further at the documentation.
I... really am just trying things to stop hitting an infinite loop. This got me to a timeout error when run in tests. Though, when running e2e in Pangeo-Forge, I definitely experience a runtime error complaining that I wasn't in an
Interesting! Do the tests pass for you? What is your environment like? I'm concerned that I'm hitting another architecture issue with ARM. Thanks for taking a look at this, Tom. |
1b6ec0f
to
282e2aa
Compare
- CoGroupByKey is broken due to how tags are used with GroupByKey - GroupByKey should output `[('0', None), ('1', 1)]`, however it actually outputs: [(None, ('1', 1)), (None, ('0', None))] - Once that is fixed, we may have test pipelines work on Dask.
…initial tests pass.
Run Python PreCommit |
1 similar comment
Run Python PreCommit |
the only test that is giving trouble should be easy to fix or skip for now. I'll review the PR as is and maybe we'll merge it soon |
Thanks Pablo. I think I can easily fix it – I'm having trouble reproducing the issue on my local environment due to my M1 woes. |
ok I've taken a look. The code, in fact, looks so clean that I'm very happy to merge. |
Run Python PreCommit |
2 similar comments
Run Python PreCommit |
Run Python PreCommit |
ugggg haha can't get a passing precommit even though the tests are unrelated. |
Run Python PreCommit |
Run Python PreCommit |
sorry about the crazy flakiness. Something is going on recently with our precommits... |
ugggg incredibly enough, this issue reproduces only very occasionally in my environment. |
Run Python PreCommit |
given no changes anywhere close to the current flaky tests, I will merge. |
LGTM |
Wohoo! |
Thanks. Python Precommit showing following test failure: https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/6286/
|
thanks Yi for pointing this out |
* WIP: Created a skeleton dask runner implementation. * WIP: Idea for a translation evaluator. * Added overrides and a visitor that translates operations. * Fixed a dataclass typo. * Expanded translations. * Core idea seems to be kinda working... * First iteration on DaskRunnerResult (keep track of pipeline state). * Added minimal set of DaskRunner options. * WIP: Alllmost got asserts to work! The current status is: - CoGroupByKey is broken due to how tags are used with GroupByKey - GroupByKey should output `[('0', None), ('1', 1)]`, however it actually outputs: [(None, ('1', 1)), (None, ('0', None))] - Once that is fixed, we may have test pipelines work on Dask. * With a great 1-liner from @pabloem, groupby is fixed! Now, all three initial tests pass. * Self-review: Cleaned up dask runner impl. * Self-review: Remove TODOs, delete commented out code, other cleanup. * First pass at linting rules. * WIP, include dask dependencies + test setup. * WIP: maybe better dask deps? * Skip dask tests depending on successful import. * Fixed setup.py (missing `,`). * Added an additional comma. * Moved skipping logic to be above dask import. * Fix lint issues with dask runner tests. * Adding destination for client address. * Changing to async produces a timeout error instead of stuck in infinite loop. * Close client during `wait_until_finish`; rm async. * Supporting side-inputs for ParDo. * Revert "Close client during `wait_until_finish`; rm async." This reverts commit 09365f6. * Revert "Changing to async produces a timeout error instead of stuck in infinite loop." This reverts commit 676d752. * Adding -dask tox targets onto the gradle build * wip - added print stmt. * wip - prove side inputs is set. * wip - prove side inputs is set in Pardo. * wip - rm asserts, add print * wip - adding named inputs... * Experiments: non-named side inputs + del `None` in named inputs. * None --> 'None' * No default side input. * Pass along args + kwargs. * Applied yapf to dask sources. * Dask sources passing pylint. * Added dask extra to docs gen tox env. * Applied yapf from tox. * Include dask in mypy checks. * Upgrading mypy support to python 3.8 since py37 support is deprecated in dask. * Manually installing an old version of dask before 3.7 support was dropped. * fix lint: line too long. * Fixed type errors with DaskRunnerResult. Disabled mypy type checking in dask. * Fix pytype errors (in transform_evaluator). * Ran isort. * Ran yapf again. * Fix imports (one per line) * isort -- alphabetical. * Added feature to CHANGES.md. * ran yapf via tox on linux machine * Change an import to pass CI. * Skip isort error; needed to get CI to pass. * Skip test logic may favor better with isort. * (Maybe) the last isort fix. * Tested pipeline options (added one fix). * Improve formatting of test. * Self-review: removing side inputs. In addition, adding a more helpful property to the base DaskBagOp (tranform). * add dask to coverage suite in tox. * Capture value error in assert. * Change timeout value to 600 seconds. * ignoring broken test * Update CHANGES.md * Using reflection to test the Dask client constructor. * Better method of inspecting the constructor parameters (thanks @TomAugspurger!). Co-authored-by: Pablo E <pabloem@apache.org> Co-authored-by: Pablo <pabloem@users.noreply.github.com>
Here, I've created a minimum viable Apache Beam runner for Dask. My approach is to visit a Beam Pipeline and translate PCollections into Dask Bags.
In this version, I have supported enough operations to make test pipeline asserts work. The test themselves are not comprehensive. Further, there are many Bag operations that could be translated for greater efficiency.
CC: @pabloem
Fixes: #18962
Original PR discussion can be found here: alxmrs#1
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.