-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements #12398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
So looking back on the predecessor PR it seemed like @davies suggested adding support for specifying these at runtime with |
|
I think having this supported could be very useful :) |
python/pyspark/context.py
Outdated
| import pip | ||
| for req in pip.req.parse_requirements(path, session=uuid.uuid1()): | ||
| if not req.check_if_exists(): | ||
| pip.main(['install', req.req.__str__()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it seems that this can sometimesrequire elevated privileges based on the issues with the previous jenkins run. What about if at startup we created a fixed temp directory per context adding it to our path with sys.path.insert(0, self.pipBase) and at install did something along the lines of:
pip.main(['install', req.req.__str__(), '--target', self.pipBase]) so that we don't have to have write permissions to the default pip target?
|
Since @andrewor14 got Jenkins to run the earlier iteration of this PR maybe you could do the same here? |
|
I could see the pipBase approach working. The other testing blocker I have is that it seems like that context that's getting created in the test is able to use local packages. The test package that's created is able to be imported on the workers without distributing it via addPyPackage. https://github.com/apache/spark/pull/12398/files#diff-388a156f4ce454fe98d7a99a0f7f0012R1950 Is there a way to get the mocked workers to not use the master python path? The pip part is nice, but I think the meat of this PR is in the addPyPackage. I looked into adding a flag to the CLI, but kind of went down a rabbit hole trying to figure it out. I'd advocate adding it to the CLI, but think someone working in the spark core space could get that integrated much more efficiently than myself. |
|
ok to test |
|
Test build #2967 has finished for PR 12398 at commit
|
python/pyspark/tests.py
Outdated
| self.assertSequenceEqual([0, 3, 6, 9], trips.collect()) | ||
| finally: | ||
| shutil.rmtree(name) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the extra empty line
|
@buckhx These API seems useful, could you also add an argument for bin/spark-submit (only for requirement file) ? |
|
Added an addPyPackage example to the docstring and that test formatting. We're looking into adding the spark-submit arg for --py-requirements. The pip API takes a file path via pip.req.parse_requirements https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L64 but it looks like it might be possible to take a string and break it out line by with a mock file name via pip.req.process_line & pip.req.preprocess https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L110 |
|
@buckhx Even the pip API only takes a file path, we could write the as temporary file, I think. |
add py-requirements submit option
|
@robert3005 added the --py-requirements to spark-submit. I'm looking at passing the requirements as a list or a line delimited string currently. |
|
changed addRequirementsFile to addPyRequirements which takes a list of requirement strings |
|
@davies how's this looking? |
|
Test build #3068 has finished for PR 12398 at commit
|
|
@buckhx any interest in updating this against master? |
|
@buckhx Just following up to see if this is something you are still interested in working on? |
|
ping @buckhx |
## What changes were proposed in this pull request? This PR proposes to close stale PRs. What I mean by "stale" here includes that there are some review comments by reviewers but the author looks inactive without any answer to them more than a month. I left some comments roughly a week ago to ping and the author looks still inactive in these PR below These below includes some PR suggested to be closed and a PR against another branch which seems obviously inappropriate. Given the comments in the last three PRs below, they are probably worth being taken over by anyone who is interested in it. Closes apache#7963 Closes apache#8374 Closes apache#11192 Closes apache#11374 Closes apache#11692 Closes apache#12243 Closes apache#12583 Closes apache#12620 Closes apache#12675 Closes apache#12697 Closes apache#12800 Closes apache#13715 Closes apache#14266 Closes apache#15053 Closes apache#15159 Closes apache#15209 Closes apache#15264 Closes apache#15267 Closes apache#15871 Closes apache#15861 Closes apache#16319 Closes apache#16324 Closes apache#16890 Closes apache#12398 Closes apache#12933 Closes apache#14517 ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#16937 from HyukjinKwon/stale-prs-close.
What changes were proposed in this pull request?
Context.addPyPackage()
Context.addPyRequirements()
Both of these methods take a package on the master and ship it to the workers when called instead of having to manually install packages on all workers.
How was this patch tested?
Unit tests are written, but I do not believe they accurately reflect a distributed environment. The test_add_py_package is not using addPyPackage and still works. The addRequirementsFile method requires internet access to hit the global pypi server and won't work on the current Jenkins build system.
We have had this patch running at Palantir for about a year in production.