Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Nov 26, 2014

Simply, add data/ to distributions. This adds about 291KB (compressed) to the tarball, FYI.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23894 has started for PR 3480 at commit 47688f1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23894 has finished for PR 3480 at commit 47688f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23894/
Test PASSed.

@pwendell
Copy link
Contributor

Hey sean - any reason not to put these in src/main/resources within the examples module? This is what spark sql does and it seems like a better model. That location is included in the dist already.

@srowen
Copy link
Member Author

srowen commented Nov 28, 2014

Generally yes I'd put resources in src/main/resources. These aren't in that location, and the examples are written to expect them under data/ at the root level. I suppose putting them in src/main/resources causes them to be built into the .jar file, and they're not tiny. That is, they're not actually used as resources in this way. So I left them where they are but just put them into the tarball in the same place the source tree and examples expect them.

@pwendell
Copy link
Contributor

Hey Sean - I don't quite understand. The only use of the examples project is to produce the assembly jar for use in distributions, so it seems legitimate to include them as resources for that project. Putting them in the jar would not increase the total size of the distribution, it would just relocate them to being inside of the jar.

The examples jar is not used outside of this context, so embedding more data in there doesn't matter, from what I can tell. We actually removed examples from the set of published jars for 1.2. It was sort of weird that we were publishing it since there is no public API in there and just standalone programs.

@srowen
Copy link
Member Author

srowen commented Nov 30, 2014

@pwendell Sure, I agree that the size isn't a big deal, and there's not really a case where you would use the distribution without the examples .jar. Forget the size issue.

Really it is that none of the code would read the data files from src/main/resources, nor would any of the examples as given in their documentation. They all refer to data/, which is where the files are in the source tree and in the source distribution. This makes the binary distribution consistent with those, and with all the documented examples.

If there's interest, of course I can make another PR to move these files and update all the docs to look for examples/src/main/resources/... Heck you could even write the examples to look for the data files within the .jar instead.

@pwendell
Copy link
Contributor

pwendell commented Dec 1, 2014

Oh I see - yeah I meant we'd also re-write the examples to correctly load example data from the classpath. If something is in src/main/resources, you can just look for the resource using java's resource API. However, maybe this makes the examples too confusing. I could see someone getting tripped up on loading a file from a classpath resource (a fairly advanced concept). Also if the examples do sc.textFile(/path/to/example) or something, that won't work. So maybe my suggestion is at odds with having simple, easy to understand, examples.

@mengxr
Copy link
Contributor

mengxr commented Dec 1, 2014

@pwendell The example data do not need to be on the classpath. They are sample data files used by mllib examples, e.g., BinaryClassification, MovieLensALS. Usually the example code is the starting point for users. @srowen 's change makes it easy to run exmaples:

  1. download and unzip the distribution zip
  2. run bin/run-example mllib.DatasetExample, which will read a file under data/ by default.

The change looks good to me.

asfgit pushed a commit that referenced this pull request Dec 1, 2014
Simply, add data/ to distributions. This adds about 291KB (compressed) to the tarball, FYI.

Author: Sean Owen <sowen@cloudera.com>

Closes #3480 from srowen/SPARK-2192 and squashes the following commits:

47688f1 [Sean Owen] Add data/ to distributions

(cherry picked from commit 6384f42)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in 6384f42 Dec 1, 2014
@mengxr
Copy link
Contributor

mengxr commented Dec 2, 2014

Merged into master and branch-1.2. Thanks!

@srowen srowen deleted the SPARK-2192 branch December 3, 2014 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants