Skip to content

Conversation

@avulanov
Copy link
Contributor

What changes were proposed in this pull request?

Fix the construction of the file path. Previous way of construction caused the creation of incorrect path on Windows.

How was this patch tested?

Run SQL unit tests on Windows

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61089 has finished for PR 13868 at commit f6eea2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 23, 2016

Can you put in a proper title?

@rxin
Copy link
Contributor

rxin commented Jun 23, 2016

Also looks like the fix is wrong?

@gatorsmile
Copy link
Member

cc @yhuai

@srowen
Copy link
Member

srowen commented Jun 23, 2016

Yes, this is not the change discussed in the JIRA. The best way forward seems to be to replace attempts to make a file: URI manually from a string with use of File.toURI or something from Java 7's Paths to let the JDK do it properly.

@avulanov avulanov changed the title [SPARK-15899] [SQL] [SPARK-15899] [SQL] Fix the construction of the file path with File's toURI Jun 23, 2016
@avulanov
Copy link
Contributor Author

@srowen @rxin There was an issue with string replace that added another slash that prevented unit test to succeed. Fixed this.

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61120 has finished for PR 13868 at commit 40af9b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 24, 2016

This looks like the right direction. There are more instances of file: used this way in the code, but only a few look like they might need a similar treatment. DDLSuite sticks out, and in fact it failed here, so probably needs a similar set of fixes. HiveSparkSubmitSuite might too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is the right fix. What will happen if a user's default fs is hdfs or s3 and he/she set spark.sql.warehouse.dir in the conf?

Copy link
Contributor Author

@avulanov avulanov Jun 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toURI behaves strange when the path is not local:

scala> new File("hdfs://m.com:9000/data").toURI
res6: java.net.URI = file:/C:/Program%20Files%20(x86)/scala/bin/hdfs:/m.com:9000
/data

I think we can add a check and apply toURI when the string does not start from hdfs: or s3:. Will anyone use hdfs or s3 here, and, if yes, will it actually work? There are more instances of file: in code, so it seems that developers do not expect that.

@avulanov
Copy link
Contributor Author

avulanov commented Jun 25, 2016

It seems that the root cause is that hadoop Path.toUri adds only / to the beginning of local path but does not add file:. So, here and there in the code we observe in-place concatenations of file: and /. Java File toURI adds file:/ to the beginning and / to the end of any path. In particular, Java File.toURI.getPath equals to Hadoop Path.toUri.toString. Also, some tests did not handle Windows paths with back slashes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2016

Test build #61214 has finished for PR 13868 at commit a797fbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2016

Test build #61219 has finished for PR 13868 at commit fb50965.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 25, 2016

Hm, where we are conceptually dealing with a local file only, we should use the File API. If some tests are making assertions about the path, really, rather than full URI, then you can get that from getPath. Is that feasible?

I don't observe that File.getURI appends a slash always. It does so for directories, which sounds reasonable as a canonicalization.

@SparkQA
Copy link

SparkQA commented Jun 28, 2016

Test build #61343 has finished for PR 13868 at commit 90a413d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@avulanov
Copy link
Contributor Author

I used makeQualifiedPath from SessionCatalog to construct paths. It uses hadoop's Path and FileSystem. Each attempt of path construction should use this approach in order to be consistent with Spark API. There are 36 occurrences of "file:" or "file://" or "file:/" in Spark Scala code as String constants. It is either checking if the path is a local file or construction of a path for a local file. Approaches differ. For example:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1717 (uses java.net.URI)
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L611 (uses java.File)

The same constant is frequently used in Spark Python.

@SparkQA
Copy link

SparkQA commented Jun 28, 2016

Test build #61344 has finished for PR 13868 at commit f247423.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@avulanov
Copy link
Contributor Author

avulanov commented Jul 5, 2016

@srowen Do you suggest to fix all occurrences of file as a constant string or leave the patch as is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As warehousePath will be used like new Path(warehousePath), this will introduce a regression when the path contains spaces or other special characters. E.g.,

scala> new Path("foo bar")
res4: org.apache.hadoop.fs.Path = foo bar

scala> new Path(new Path("foo bar").toUri.toString)
res5: org.apache.hadoop.fs.Path = foo%20bar

@avulanov could you try the following code in Windows and see if it works?

new Path(new Path(getConf(WAREHOUSE_PATH).replace("${system:user.dir}",
       System.getProperty("user.dir"))).toUri).toString

Copy link
Contributor Author

@avulanov avulanov Jul 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is equivalent to new Path. This does not add a leading slash, and we need to check if it is still needed with the changes we introduced.

scala> new Path(new Path("c:\\foo path").toUri).toString
res1: String = c:/foo path
scala> new Path("c:\\foo path")
res2: org.apache.hadoop.fs.Path = c:/foo path
scala> new Path("c:\\foo path").toUri
res3: java.net.URI = /c:/foo%20path

Copy link
Member

@zsxwing zsxwing Jul 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avulanov we can check and add the leading slash if missing. Right?

The problem of URI.toString is that it outputs an encoded string but Path expects the original string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avulanov Is this comment addressed ? It looks to me that its a valid concern that doing toUri.toString would introduce encoding that will mean we cant use warehousePath as new Path(warehousePath) anymore ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is better to use URLDecoder.decode() to decode a URL encoding (e.g. %20) string?

@avulanov
Copy link
Contributor Author

avulanov commented Aug 3, 2016

@shivaram @zsxwing Removed toUri and did rebase. 4 new tests from DDLSuite fail on Windows (OK on Linux). Need some more time to fix them.

@vanzin Could you please check that I don't break your change of def warehousePath in SQLConf.scala line 682 from 75a06aa ?
cc @srowen

@SparkQA
Copy link

SparkQA commented Aug 3, 2016

Test build #63167 has finished for PR 13868 at commit c78da11.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2016

Test build #63172 has finished for PR 13868 at commit 2b592b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val pathInCatalog = new Path(catalog.getDatabaseMetadata("db1").locationUri).toUri
assert("file" === pathInCatalog.getScheme)
val expectedPath = if (path.endsWith(File.separator)) path.dropRight(1) else path
val expectedPath = new Path(path).toUri.toString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this toUri need to be removed too?
Yeah I believe there are at least similar changes needed in DDLSuite but should be about the same issue.

@avulanov
Copy link
Contributor Author

avulanov commented Aug 4, 2016

@srowen @vanzin Thank you. I addressed your comments and fixed the mentioned 4 tests from DDLSuite.scala that failed on Windows.

@avulanov avulanov changed the title [SPARK-15899] [SQL] Fix the construction of the file path with File's toURI [SPARK-15899] [SQL] Fix the construction of the file path with hadoop Path Aug 4, 2016
@SparkQA
Copy link

SparkQA commented Aug 4, 2016

Test build #63222 has finished for PR 13868 at commit 1b5b035.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Aug 4, 2016

Looks good; I think some toString calls could be cleaned up (e.g. makeQualifiedPath could return a String as far as I can see, don't need toString to use a variable in an interpolated string, etc), but those are minor.

@srowen
Copy link
Member

srowen commented Aug 6, 2016

Any further comments, watchers? maybe worth implementing Marcelo's last comments and then let's merge.

@srowen
Copy link
Member

srowen commented Aug 9, 2016

@avulanov can you have one more look at Marcelo's last small comments?

@avulanov
Copy link
Contributor Author

avulanov commented Aug 9, 2016

@srowen Sure. I've addressed @vanzin 's comments

@srowen
Copy link
Member

srowen commented Aug 9, 2016

OK will merge soonish if there are no further comments. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63430 has finished for PR 13868 at commit ea24b59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 11a6844 Aug 10, 2016
asfgit pushed a commit that referenced this pull request Aug 10, 2016
…Path

## What changes were proposed in this pull request?

Fix the construction of the file path. Previous way of construction caused the creation of incorrect path on Windows.

## How was this patch tested?

Run SQL unit tests on Windows

Author: avulanov <nashb@yandex.ru>

Closes #13868 from avulanov/SPARK-15899-file.

(cherry picked from commit 11a6844)
Signed-off-by: Sean Owen <sowen@cloudera.com>

# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
#	sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
@srowen
Copy link
Member

srowen commented Aug 10, 2016

Merged to master, and to 2.0 after a weird conflict and then merge problem (Github API isn't matching the status shown here?). I'll keep an eye on the builds to make sure I did it right.

@srowen
Copy link
Member

srowen commented Aug 10, 2016

This didn't work in branch-2.0. I'm going to investigate briefly and almost certainly revert.

@avulanov
Copy link
Contributor Author

@srowen Could you give me a link to the log of the failed 2.0 build?

@srowen
Copy link
Member

srowen commented Aug 10, 2016

It's weird. Some branch 2.0 tests fail in core, but with no visible error anywhere. But I spotted this failure:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.4/538/consoleFull

testCrosstab(test.org.apache.spark.sql.JavaDataFrameSuite)  Time elapsed: 0.098 sec  <<< ERROR!
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.internal.SharedState':
    at test.org.apache.spark.sql.JavaDataFrameSuite.setUp(JavaDataFrameSuite.java:55)
Caused by: java.lang.reflect.InvocationTargetException
    at test.org.apache.spark.sql.JavaDataFrameSuite.setUp(JavaDataFrameSuite.java:55)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:user.dir%7D/spark-warehouse
    at test.org.apache.spark.sql.JavaDataFrameSuite.setUp(JavaDataFrameSuite.java:55)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:user.dir%7D/spark-warehouse
    at test.org.apache.spark.sql.JavaDataFrameSuite.setUp(JavaDataFrameSuite.java:55)

I thought I must have screwed up the conflict resolution, but I can't see an error on visually evaluating the two:

Master
11a6844
2.0
719ac5f

This was one key conflict:
719ac5f#diff-32bb9518401c0948c5ea19377b5069abR694

and the other was just in a test in DDLSuite that didn't exist in 2.0, it seems.

Not sure what to make of it. Thanks for looking if you're able to evaluate this. I will revert in the meantime.

@avulanov
Copy link
Contributor Author

@srowen It seems that the issue is with the new version of warehousePath. It did string replace user.dir in the path in the original 2.0. warehousePath was simplified in master due to new getConf that does the replace. The new 2.0 719ac5f at your link has old getConf and new warehousePath. In particular, the following occurs:

scala> val x = new Path("${system:x.d}")
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path i
n absolute URI: ${system:x.d%7D

There are at least two ways to fix this: either use warehousePath with string replace or new version(s) of getConf. The former seems simpler:

  def warehousePath: String = {
     new Path(getConf(WAREHOUSE_PATH).replace("${system:user.dir}", System.getProperty("user.dir"))).toString
   }

It was in one of my commits fb12118 until I bumped into 75a06aa

@srowen
Copy link
Member

srowen commented Aug 10, 2016

OK, if you're willing to go one more round and prepare a patch for 2.0, it would be much appreciated. You're close to the change and I know several people hit this with 2.0.0. I will revert my handiwork in the branch for now.

@avulanov
Copy link
Contributor Author

@srowen OK, I think I can do that. Do I understand correctly, that I need to clone 2.0, make the necessary changes and then make a new PR?

@srowen
Copy link
Member

srowen commented Aug 10, 2016

Yeah, you'd branch from branch-2.0, cherry-pick your original PR, resolve the conflicts as needed, and then open a new PR.

asfgit pushed a commit that referenced this pull request Aug 11, 2016
…Path for Spark 2.0

This PR contains the adaptation of #13868 for Spark 2.0

## What changes were proposed in this pull request?

Fix the construction of the file path in `SQLConf.scala` and unit tests that rely on this: `SQLConfSuite` and `DDLSuite`. Previous way of construction caused the creation of incorrect path on Windows.

## How was this patch tested?

Run unit tests on Windows

Author: avulanov <nashb@yandex.ru>

Closes #14600 from avulanov/SPARK-15899-file-2.0.
.doc("The default location for managed databases and tables.")
.stringConf
.createWithDefault("file:${system:user.dir}/spark-warehouse")
.createWithDefault("${system:user.dir}/spark-warehouse")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avulanov can I call on your expertise here? @koertkuipers and I noticed that this causes a problem, in that this path intends to be a local file system path in the local home dir, but will now be interpreted as a path on HDFS for HDFS deployments.

If this is intended to be a local path always, and it seems like it is, then the usages of the new makeQualifiedPath are a bit wrong in that they explicitly resolve the path against the Hadoop file system, which can be HDFS.

Alternatively, just removing user.dir kind of works too, in that it will at least become a path relative to the HDFS user dir I think. Do you know which is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or use FileSystem.getHomeDirectory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would resolve it, probably, if the intent is to let this become a directory on HDFS. I think it was supposed to be a local file so maybe we have to find a Windows-friendly way to bring back the file: prefix.

Maybe make the default value just "spark-warehouse" and then below in def warehousePath, add logic to resolve this explicitly against the LocalFilesystem? I'll give that a shot soon if nobody has better ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to have a local directory here, always, you could add this to the config constant:

.transform(new File(_).toURI().toString())

Which should be Windows-compatible, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed https://issues.apache.org/jira/browse/SPARK-17810 and am about to open a PR for the fix I proposed. I think you're right, though then I wonder, what if I set the value to "/my/local/path"? it still will get interpreted later as an HDFS path, when as I understand it's always supposed to be treated as a local path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.transform also applies to user-provided values, so "/my/local/path" would become "file:/my/local/path" or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can use File.toURI() for WAREHOUSE_PATH, if we are sure that it is always a local path. However, I remember someone in this thread mentioned that the path might be an amazon s3 path. Is this supposed to happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.