Skip to content

Conversation

@loneknightpy
Copy link
Contributor

What changes were proposed in this pull request?

This PR makes spark-submit script download remote files to local file system for local/standalone client mode.

How was this patch tested?

  • Unit tests
  • Manual tests by adding s3a jar and testing against file on s3.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@vanzin
Copy link
Contributor

vanzin commented May 24, 2017

Please reference the existing bug (SPARK-10643) instead.

@loneknightpy loneknightpy changed the title [SPARK-20860] Make spark-submit download remote files to local in client mode [SPARK-10643] Make spark-submit download remote files to local in client mode May 24, 2017
@SparkQA
Copy link

SparkQA commented May 24, 2017

Test build #77268 has finished for PR 18078 at commit 4cdeeed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 24, 2017

Test build #77271 has finished for PR 18078 at commit 6e86290.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! The PR looks good overall, it would be great if more negative test cases could be added such as invalid file paths.

RPackageUtils.checkAndBuildRPackage(args.jars, printStream, args.verbose)
}

// In client mode, download remotes files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "remotes" -> "remote"

test("resolves command line argument paths correctly") {
val jars = "/jar1,/jar2" // --jars
val files = "hdfs:/file1,file2" // --files
val files = "local:/file1,file2" // --files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on why we are changing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it not try to download file from hdfs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is kinda difficult to test download file from hdfs now, but we should cover this scene in the future.

@jiangxb1987
Copy link
Contributor

Could you also add "[Core]" tag in the title? @loneknightpy
Also cc @cloud-fan @gatorsmile

@loneknightpy loneknightpy changed the title [SPARK-10643] Make spark-submit download remote files to local in client mode [SPARK-10643] [Core] Make spark-submit download remote files to local in client mode May 25, 2017
@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77388 has started for PR 18078 at commit e5171ca.

@jiangxb1987
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77389 has started for PR 18078 at commit 62e57df.

}

// In client mode, download remote files.
if (deployMode == CLIENT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I may not have enough background knowledge, why we only do this for client mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it can handle remote files in Yarn/Mesos cluster mode. I haven't tested it, because we are using client mode.

/**
* Download a list of remote files to temp local files. If the file is local, the original file
* will be returned.
* @param fileList A comma separated file list, it cannot be null.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no need to add the comment it cannot be null. Just add an assert at the beginning of the function to ensure it.

}

/**
* Download remote file to a temporary local file. If the file is local, the original file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

Downloads a file from the remote to a local temporary directory. If the input path points to a local path, returns it with no operation

.fromPath(tmpFile.getAbsolutePath)
.scheme("file")
.build()
.toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        val localPath = new Path(tmpFile.getAbsolutePath)
        fs.copyToLocalFile(new Path(uri), localPath)

        val localFS: FileSystem = localPath.getFileSystem(hadoopConf)
        localFS.makeQualified(localPath).toString

Does this work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like our code base never calls UriBuilder directly.

Copy link
Contributor Author

@loneknightpy loneknightpy May 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If UriBuilder is a concern, we can just use "file:${tmpFile.getAbsolutePath}"

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77412 has started for PR 18078 at commit 7eb5d1a.

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77413 has started for PR 18078 at commit 3f18b81.

printStream.println(s"Downloading ${uri.toString} to ${tmpFile.getAbsolutePath}.")
// scalastyle:on println
fs.copyToLocalFile(new Path(uri), new Path(tmpFile.getAbsolutePath))
s"file:${tmpFile.getAbsolutePath}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or calling Utils.resolveURI(tmpFile.getAbsolutePath).toString?

It sounds Utils.resolveURI is commonly used for this purpose?

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77418 has finished for PR 18078 at commit 3f18b81.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77427 has finished for PR 18078 at commit 2d6f2cd.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@loneknightpy
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77434 has finished for PR 18078 at commit 2d6f2cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 26, 2017

Test build #77435 has finished for PR 18078 at commit 2d6f2cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 26, 2017
…in client mode

## What changes were proposed in this pull request?

This PR makes spark-submit script download remote files to local file system for local/standalone client mode.

## How was this patch tested?

- Unit tests
- Manual tests by adding s3a jar and testing against file on s3.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Yu Peng <loneknightpy@gmail.com>

Closes #18078 from loneknightpy/download-jar-in-spark-submit.

(cherry picked from commit 4af3781)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
@asfgit asfgit closed this in 4af3781 May 26, 2017
@gatorsmile
Copy link
Member

gatorsmile commented May 26, 2017

Thanks! Merging to master/2.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants