-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-29842: Make Executor.map less eager so it handles large/unbounded… #707
Conversation
… input iterables appropriately
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA. This is necessary for legal reasons before we can look at your contribution. Please follow these steps to help rectify the issue:
Thanks again to your contribution and we look forward to looking at it! |
@MojoVampire, thanks for your PR! By analyzing the history of the files in this pull request, we identified @birkenfeld, @brianquinlan and @ezio-melotti to be potential reviewers. |
I've already submitted a contributor agreement pre-GitHub migration. I just updated my b.p.o. user profile (josh.r) to link to my GitHub account name. Is that sufficient, or do I need to submit a new contributor agreement based on my GitHub e-mail address? |
Hmm... Is the failure of continuous-integration/travis-ci/pr something real? Clicking Details just tells me it can't find a python/cpython repository at all... |
You can also take a look at my implementation that I uploaded to https://github.com/pkch/executors. It does something more like what I described in the issue tracker, the main benefit being that it's not blocking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM at code level.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
@MojoVampire could you share your plans about this PR? Do you plan to drive it forward? |
I have made the requested changes; please review again. I did not add a Misc/NEWS entry since the file no longer exists (it's autogenerated from commit messages now, correct?). |
Thanks for making the requested changes! @methane: please review the changes made to this pull request. |
NEWS entries can be generated using blurb or blurb-it Please see : https://devguide.python.org/committing/?highlight=news#what-s-new-and-news-entries |
I have made the requested changes; please review again. Actually made the Misc/NEWS entry properly. Sorry for confusion; I haven't made a PR since the news Misc/NEWS regime began and didn't know about the blurb tool. Thanks for the assist @tirkarthi |
Thanks for making the requested changes! @methane: please review the changes made to this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments below. A significant issue is that this changes the behaviour of shutdown(wait=True)
to not wait for completion of all pending futures. I don't think that's an acceptable change.
|
||
By default, a reasonable number of tasks are | ||
queued beyond the number of workers, an explicit *prefetch* count may be | ||
provided to specify how many extra tasks should be queued. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using "chunks" here would be more precise than "tasks".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for chunksize uses the phrasing "this method chops iterables into a number of chunks which it submits to the pool as separate tasks", and since not all executors even use chunks (ThreadPoolExecutor ignores the argument), I figured I'd stick with "tasks". It does kind of leave out a term to describe a single work item; the docs uses chunks and tasks as synonyms, with no term for a single work item.
self.assertCountEqual(finished, range(10)) | ||
# No guarantees on how many tasks dispatched, | ||
# but at least one should have been dispatched | ||
self.assertGreater(len(finished), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change breaks compatibility. The doc for shutdown
says:
If wait is True then this method will not return until all the pending futures are done executing and the resources associated with the executor have been freed.
So all futures should have executed, instead of being cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time I wrote it, it didn't conflict with the documentation precisely; the original documentation said that map was "Equivalent to map(func, *iterables) except func is executed asynchronously and several calls to func may be made concurrently.", but doesn't guarantee that any actual futures exist (it's implemented in terms of submit and futures, but doesn't actually require such a design).
That said, it looks like you updated the documentation to add "the iterables are collected immediately rather than lazily;", which, if considered a guarantee, rather than a warning, would make this a breaking change even ignoring the "cancel vs. wait" issue.
Do you have any suggestions? If strict adherence to your newly (as of late 2017) documented behavior is needed, I suppose I could change the default behavior from "reasonable prefetch" to "exhaustive prefetch", so when prefetch isn't passed, every task is submitted, but it would be kind of annoying to lose the "good by default" behavior of limited prefetching.
The reason I cancelled rather than waiting on the result is that I was trying to follow the normal use pattern for map; since the results are yielded lazily, if the iterator goes away or is closed explicitly (or you explicitly shut down the executor), you're done; having the outstanding futures complete when you're not able to see the results means you're either:
- Expecting the tasks to complete without running out the Executor.map (which doesn't work with Py3's map at all, so the analogy to map should allow it; if you don't run it out, you have no guarantees anything was done)
- Not planning to use any further results (in which case running any submitted but unscheduled futures means doing work no one can see the results of)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think there are two problems to discuss:
- What happens when
shutdown(wait=True)
is called. Currently it waits for all outstanding tasks. I don't think we can change that (the explicitwait
flag exists for a reason). - Whether
map()
can be silently switched to a lazy mode of operation. There's a (perhaps minor) problem with that. Currently, if one of iterables raises an error,map()
propagates the exception. With your proposal, the exception may be raised later inside the result iterator.
I think 2) might easily be worked around by introducing a separate method (lazy_map
?).
It seems it would be good to discuss those questions on the mailing-list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the problem with using the "lazy_map" name is that it feels like recreating the same annoying distinctions between map and imap from the Py2 era, and it would actually have Executor.map (which is supposed to match map, which lazily consumes the input(s)) be less similar to map than Executor.lazy_map.
If it's necessary to gain acceptance, I could change the default behavior to use prefetch=sys.maxsize - self._max_workers. It would match the pre-existing behavior for just about anything that conceivably worked before (modulo the tiny differences in memory usage of deque vs. list for storing the futures) since:
- All tasks would be submitted fully up front, so shutdown(wait=True) would in fact wait on them (and no further calls to submit would occur in the generator, so submitting wouldn't occur post-shutdown, which would raise a RuntimeError and cause the cancellation
- It wouldn't be lazy for anything by default (it would either work eagerly or crash, in the same manner it currently behaves)
If you passed a reasonable prefetch, you wouldn't have these behaviors (and we should document that interaction), but at least existing code would continue to work identically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion. I think discussing those alternatives on the ML, to gather more opinions and arguments, would be useful.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
…lds no reference to result at the moment it yields Reduce line lengths to PEP8 limits
@MojoVampire I think you need to comment your PR with |
This PR is stale because it has been open for 30 days with no activity. |
@MojoVampire, are you going to follow up this PR? |
@erlend-aasland: I'd like to apply this, but I never got any idea of what would constitute an acceptable final result. Executor.map is, frankly, useless for many of its intended purposes right now; in an effort to improve performance on huge inputs, you end up prefetching the entire input and pre-scheduling all the tasks before you can process any of them. |
cc. @kumaraditya303, is this something you'd be interested in reviewing? |
I'll take a look soon, but it seems there are two PRs for the same thing and this one has conflicts so maybe we should continue on #18566 |
Closing this in favor of #18566, thanks for the PR! |
… input iterables appropriately
https://bugs.python.org/issue29842
https://bugs.python.org/issue29842