Skip to content

Conversation

@buckhx
Copy link

@buckhx buckhx commented Apr 14, 2016

What changes were proposed in this pull request?

Context.addPyPackage()
Context.addPyRequirements()

Both of these methods take a package on the master and ship it to the workers when called instead of having to manually install packages on all workers.

How was this patch tested?

Unit tests are written, but I do not believe they accurately reflect a distributed environment. The test_add_py_package is not using addPyPackage and still works. The addRequirementsFile method requires internet access to hit the global pypi server and won't work on the current Jenkins build system.

We have had this patch running at Palantir for about a year in production.

@holdenk
Copy link
Contributor

holdenk commented Apr 14, 2016

So looking back on the predecessor PR it seemed like @davies suggested adding support for specifying these at runtime with spark-submit & it seems like the pip install permissions issue might still be present for testing.

@holdenk
Copy link
Contributor

holdenk commented Apr 14, 2016

I think having this supported could be very useful :)

import pip
for req in pip.req.parse_requirements(path, session=uuid.uuid1()):
if not req.check_if_exists():
pip.main(['install', req.req.__str__()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems that this can sometimesrequire elevated privileges based on the issues with the previous jenkins run. What about if at startup we created a fixed temp directory per context adding it to our path with sys.path.insert(0, self.pipBase) and at install did something along the lines of:
pip.main(['install', req.req.__str__(), '--target', self.pipBase]) so that we don't have to have write permissions to the default pip target?

@holdenk
Copy link
Contributor

holdenk commented Apr 14, 2016

Since @andrewor14 got Jenkins to run the earlier iteration of this PR maybe you could do the same here?

@buckhx
Copy link
Author

buckhx commented Apr 18, 2016

I could see the pipBase approach working.

The other testing blocker I have is that it seems like that context that's getting created in the test is able to use local packages. The test package that's created is able to be imported on the workers without distributing it via addPyPackage. https://github.com/apache/spark/pull/12398/files#diff-388a156f4ce454fe98d7a99a0f7f0012R1950

Is there a way to get the mocked workers to not use the master python path?

The pip part is nice, but I think the meat of this PR is in the addPyPackage.

I looked into adding a flag to the CLI, but kind of went down a rabbit hole trying to figure it out. I'd advocate adding it to the CLI, but think someone working in the spark core space could get that integrated much more efficiently than myself.

@andrewor14
Copy link
Contributor

andrewor14 commented May 4, 2016

ok to test

@davies

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #2967 has finished for PR 12398 at commit ce9966e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

self.assertSequenceEqual([0, 3, 6, 9], trips.collect())
finally:
shutil.rmtree(name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the extra empty line

@davies
Copy link
Contributor

davies commented May 4, 2016

@buckhx These API seems useful, could you also add an argument for bin/spark-submit (only for requirement file) ?

@buckhx
Copy link
Author

buckhx commented May 9, 2016

Added an addPyPackage example to the docstring and that test formatting. We're looking into adding the spark-submit arg for --py-requirements.

The pip API takes a file path via pip.req.parse_requirements https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L64 but it looks like it might be possible to take a string and break it out line by with a mock file name via pip.req.process_line & pip.req.preprocess https://github.com/pypa/pip/blob/develop/pip/req/req_file.py#L110

@davies
Copy link
Contributor

davies commented May 9, 2016

@buckhx Even the pip API only takes a file path, we could write the as temporary file, I think.

Robert Kruszewski and others added 2 commits May 11, 2016 00:34
@buckhx
Copy link
Author

buckhx commented May 11, 2016

@robert3005 added the --py-requirements to spark-submit. I'm looking at passing the requirements as a list or a line delimited string currently.

@buckhx buckhx changed the title [SPARK-5929][PYSPARK] Context addPyPackage and addRequirementsFile [SPARK-5929][PYSPARK] Context addPyPackage and addPyRequirements May 13, 2016
@buckhx
Copy link
Author

buckhx commented May 13, 2016

changed addRequirementsFile to addPyRequirements which takes a list of requirement strings

@buckhx
Copy link
Author

buckhx commented Jun 6, 2016

@davies how's this looking?

@SparkQA
Copy link

SparkQA commented Jun 6, 2016

Test build #3068 has finished for PR 12398 at commit 9c37e06.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Sep 7, 2016

@buckhx any interest in updating this against master?

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

@buckhx Just following up to see if this is something you are still interested in working on?

@HyukjinKwon
Copy link
Member

ping @buckhx

@asfgit asfgit closed this in ed338f7 Feb 17, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
## What changes were proposed in this pull request?

This PR proposes to close stale PRs.

What I mean by "stale" here includes that there are some review comments by reviewers but the author looks inactive without any answer to them more than a month.

I left some comments roughly a week ago to ping and the author looks still inactive in these PR below

These below includes some PR suggested to be closed and a PR against another branch which seems obviously inappropriate.

Given the comments in the last three PRs below, they are probably worth being taken over by anyone who is interested in it.

Closes apache#7963
Closes apache#8374
Closes apache#11192
Closes apache#11374
Closes apache#11692
Closes apache#12243
Closes apache#12583
Closes apache#12620
Closes apache#12675
Closes apache#12697
Closes apache#12800
Closes apache#13715
Closes apache#14266
Closes apache#15053
Closes apache#15159
Closes apache#15209
Closes apache#15264
Closes apache#15267
Closes apache#15871
Closes apache#15861
Closes apache#16319
Closes apache#16324
Closes apache#16890

Closes apache#12398
Closes apache#12933
Closes apache#14517

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#16937 from HyukjinKwon/stale-prs-close.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants