-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15899] [SQL] Fix the construction of the file path with hadoop Path #13868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #61089 has finished for PR 13868 at commit
|
|
Can you put in a proper title? |
|
Also looks like the fix is wrong? |
|
cc @yhuai |
|
Yes, this is not the change discussed in the JIRA. The best way forward seems to be to replace attempts to make a |
|
Test build #61120 has finished for PR 13868 at commit
|
|
This looks like the right direction. There are more instances of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it is the right fix. What will happen if a user's default fs is hdfs or s3 and he/she set spark.sql.warehouse.dir in the conf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toURI behaves strange when the path is not local:
scala> new File("hdfs://m.com:9000/data").toURI
res6: java.net.URI = file:/C:/Program%20Files%20(x86)/scala/bin/hdfs:/m.com:9000
/data
I think we can add a check and apply toURI when the string does not start from hdfs: or s3:. Will anyone use hdfs or s3 here, and, if yes, will it actually work? There are more instances of file: in code, so it seems that developers do not expect that.
|
It seems that the root cause is that hadoop |
|
Test build #61214 has finished for PR 13868 at commit
|
|
Test build #61219 has finished for PR 13868 at commit
|
|
Hm, where we are conceptually dealing with a local file only, we should use the File API. If some tests are making assertions about the path, really, rather than full URI, then you can get that from getPath. Is that feasible? I don't observe that File.getURI appends a slash always. It does so for directories, which sounds reasonable as a canonicalization. |
|
Test build #61343 has finished for PR 13868 at commit
|
|
I used The same constant is frequently used in Spark Python. |
|
Test build #61344 has finished for PR 13868 at commit
|
|
@srowen Do you suggest to fix all occurrences of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As warehousePath will be used like new Path(warehousePath), this will introduce a regression when the path contains spaces or other special characters. E.g.,
scala> new Path("foo bar")
res4: org.apache.hadoop.fs.Path = foo bar
scala> new Path(new Path("foo bar").toUri.toString)
res5: org.apache.hadoop.fs.Path = foo%20bar
@avulanov could you try the following code in Windows and see if it works?
new Path(new Path(getConf(WAREHOUSE_PATH).replace("${system:user.dir}",
System.getProperty("user.dir"))).toUri).toString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is equivalent to new Path. This does not add a leading slash, and we need to check if it is still needed with the changes we introduced.
scala> new Path(new Path("c:\\foo path").toUri).toString
res1: String = c:/foo path
scala> new Path("c:\\foo path")
res2: org.apache.hadoop.fs.Path = c:/foo path
scala> new Path("c:\\foo path").toUri
res3: java.net.URI = /c:/foo%20path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avulanov we can check and add the leading slash if missing. Right?
The problem of URI.toString is that it outputs an encoded string but Path expects the original string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avulanov Is this comment addressed ? It looks to me that its a valid concern that doing toUri.toString would introduce encoding that will mean we cant use warehousePath as new Path(warehousePath) anymore ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is better to use URLDecoder.decode() to decode a URL encoding (e.g. %20) string?
c78da11 to
2b592b6
Compare
|
Test build #63167 has finished for PR 13868 at commit
|
|
Test build #63172 has finished for PR 13868 at commit
|
| val pathInCatalog = new Path(catalog.getDatabaseMetadata("db1").locationUri).toUri | ||
| assert("file" === pathInCatalog.getScheme) | ||
| val expectedPath = if (path.endsWith(File.separator)) path.dropRight(1) else path | ||
| val expectedPath = new Path(path).toUri.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this toUri need to be removed too?
Yeah I believe there are at least similar changes needed in DDLSuite but should be about the same issue.
|
Test build #63222 has finished for PR 13868 at commit
|
|
Looks good; I think some |
|
Any further comments, watchers? maybe worth implementing Marcelo's last comments and then let's merge. |
|
@avulanov can you have one more look at Marcelo's last small comments? |
|
OK will merge soonish if there are no further comments. Thanks. |
|
Test build #63430 has finished for PR 13868 at commit
|
…Path ## What changes were proposed in this pull request? Fix the construction of the file path. Previous way of construction caused the creation of incorrect path on Windows. ## How was this patch tested? Run SQL unit tests on Windows Author: avulanov <nashb@yandex.ru> Closes #13868 from avulanov/SPARK-15899-file. (cherry picked from commit 11a6844) Signed-off-by: Sean Owen <sowen@cloudera.com> # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
|
Merged to master, and to 2.0 after a weird conflict and then merge problem (Github API isn't matching the status shown here?). I'll keep an eye on the builds to make sure I did it right. |
|
This didn't work in branch-2.0. I'm going to investigate briefly and almost certainly revert. |
|
@srowen Could you give me a link to the log of the failed 2.0 build? |
|
It's weird. Some branch 2.0 tests fail in core, but with no visible error anywhere. But I spotted this failure: I thought I must have screwed up the conflict resolution, but I can't see an error on visually evaluating the two: This was one key conflict: and the other was just in a test in DDLSuite that didn't exist in 2.0, it seems. Not sure what to make of it. Thanks for looking if you're able to evaluate this. I will revert in the meantime. |
|
@srowen It seems that the issue is with the new version of There are at least two ways to fix this: either use It was in one of my commits fb12118 until I bumped into 75a06aa |
|
OK, if you're willing to go one more round and prepare a patch for 2.0, it would be much appreciated. You're close to the change and I know several people hit this with 2.0.0. I will revert my handiwork in the branch for now. |
|
@srowen OK, I think I can do that. Do I understand correctly, that I need to clone 2.0, make the necessary changes and then make a new PR? |
|
Yeah, you'd branch from branch-2.0, cherry-pick your original PR, resolve the conflicts as needed, and then open a new PR. |
…Path for Spark 2.0 This PR contains the adaptation of #13868 for Spark 2.0 ## What changes were proposed in this pull request? Fix the construction of the file path in `SQLConf.scala` and unit tests that rely on this: `SQLConfSuite` and `DDLSuite`. Previous way of construction caused the creation of incorrect path on Windows. ## How was this patch tested? Run unit tests on Windows Author: avulanov <nashb@yandex.ru> Closes #14600 from avulanov/SPARK-15899-file-2.0.
| .doc("The default location for managed databases and tables.") | ||
| .stringConf | ||
| .createWithDefault("file:${system:user.dir}/spark-warehouse") | ||
| .createWithDefault("${system:user.dir}/spark-warehouse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avulanov can I call on your expertise here? @koertkuipers and I noticed that this causes a problem, in that this path intends to be a local file system path in the local home dir, but will now be interpreted as a path on HDFS for HDFS deployments.
If this is intended to be a local path always, and it seems like it is, then the usages of the new makeQualifiedPath are a bit wrong in that they explicitly resolve the path against the Hadoop file system, which can be HDFS.
Alternatively, just removing user.dir kind of works too, in that it will at least become a path relative to the HDFS user dir I think. Do you know which is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or use FileSystem.getHomeDirectory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would resolve it, probably, if the intent is to let this become a directory on HDFS. I think it was supposed to be a local file so maybe we have to find a Windows-friendly way to bring back the file: prefix.
Maybe make the default value just "spark-warehouse" and then below in def warehousePath, add logic to resolve this explicitly against the LocalFilesystem? I'll give that a shot soon if nobody has better ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to have a local directory here, always, you could add this to the config constant:
.transform(new File(_).toURI().toString())
Which should be Windows-compatible, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed https://issues.apache.org/jira/browse/SPARK-17810 and am about to open a PR for the fix I proposed. I think you're right, though then I wonder, what if I set the value to "/my/local/path"? it still will get interpreted later as an HDFS path, when as I understand it's always supposed to be treated as a local path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.transform also applies to user-provided values, so "/my/local/path" would become "file:/my/local/path" or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can use File.toURI() for WAREHOUSE_PATH, if we are sure that it is always a local path. However, I remember someone in this thread mentioned that the path might be an amazon s3 path. Is this supposed to happen?
What changes were proposed in this pull request?
Fix the construction of the file path. Previous way of construction caused the creation of incorrect path on Windows.
How was this patch tested?
Run SQL unit tests on Windows