-
Notifications
You must be signed in to change notification settings - Fork 332
[PAUSED - DO NOT REVIEW] Allow for quotes in path for blob storage #1586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I originally used Hadoop So I'm not strong on doing this either way but given there's unanimous pushback on it, I've implemented the normalization manually in the latest push. @eric-maynard - I thought about creating new |
| } | ||
| } | ||
|
|
||
| private static String scheme(String location) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: method name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to getScheme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this get missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - seems like things didn't push :( Will push with the next set of comments.
| String path; | ||
| if (scheme == null) { | ||
| path = location; | ||
| scheme = "file"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have an enum or something for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not, we are parsing them from the user-provided input. I tried hard-typing these and rejecting any schemes that are not explicitly supported (but including file formats that don't explicitly state a scheme) and then a barrage of test cases started failing due to weird input we've put in - some of them intentionally, others not. So I'm not really in favor of making such a massive shift, it will dramatically increase the scope of this change.
polaris-core/src/main/java/org/apache/polaris/core/storage/StorageLocation.java
Show resolved
Hide resolved
| path = location.substring(scheme.length() + 1); | ||
| } | ||
| path = normalizePath(path); | ||
| if (scheme.equals("file")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, let's at least put this in a const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger that, that's reasonable. Added in next revision.
| if (scheme.equals("file")) { | ||
| this.location = URI.create(LOCAL_PATH_PREFIX + path).toString(); | ||
| } else { | ||
| this.location = scheme + "://" + path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this duplicate the ://?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - the scheme will be without the "://", as we parse right up to the ":" for it. The path is normalized in the normalizePath function such that we will remove any starting slashes and the parsing for path starts after the ":" character.
| } else { | ||
| this.location = new Path(location).toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this branch used to cover relative paths, but now it won't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Trying to deal with these is only going to muddy the waters further here if we try to implement it manually :( And I don't think this is a case that we really need to care heavily for cloud storage providers given that they are really just pseudo-filesystems.
If this is a strong requirement, I strongly recommend that we go back to Hadoop's Path class instead of trying this manually....
|
|
||
| import org.apache.polaris.core.storage.StorageLocation; | ||
| import org.assertj.core.api.Assertions; | ||
| import org.junit.jupiter.api.Assertions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not assertj? They are generally more user-friendly, IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, I was testing some assertionThrows and JUnit was much easier to work with for those imo. But now that those are gone, I agree for these cases - I'll change them back.
| Assertions.assertEquals(testStorageLocation, baseStorageLocation); | ||
|
|
||
| testStorageLocation = StorageLocation.of("s3://////path'/////*to===///\"file"); | ||
| Assertions.assertEquals("s3://path'/*to===/\"file", testStorageLocation.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in this URI path' is not a "path" but "authority" per RFC 3986
| public void testGcpLocations() { | ||
| StorageLocation baseStorageLocation = StorageLocation.of("gcs://path/to/file"); | ||
| StorageLocation testStorageLocation = StorageLocation.of("gcs://path/to/////file"); | ||
| Assertions.assertEquals(testStorageLocation, baseStorageLocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test materially different from testAwsLocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In spirit, no. Let me merge this case and the AWS cases.
| public void testAwsLocations() { | ||
| StorageLocation baseStorageLocation = StorageLocation.of("s3://path/to/file"); | ||
| StorageLocation testStorageLocation = StorageLocation.of("s3://path/to/////file"); | ||
| Assertions.assertEquals(testStorageLocation, baseStorageLocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is good, but I believe more test cases are needed to validate the handling of upper / lower case letters and leading / trailing slashes and the special cases of missing scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out - let me add them. Missing scheme is in the function below with this variable: slashLeadingLocation
|
I mostly meant that this PR should not claim that it I did not mean to expand the scope of this PR. We can merge this incremental improvement, re-test the issue manually and follow-up with more PRs and integration tests as required. |
| if (schemePos > 0) { | ||
| return location.substring(0, schemePos).toLowerCase(Locale.ROOT); | ||
| } | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - this is passed early in the making of the StorageLocation and can, as a result, get filesystem locations like /var/sample/location. This is how we're figuring out that this is a local file path.
| return null; | ||
| } | ||
|
|
||
| protected String normalizePath(String path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really normalization, maybe it's canonicalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like both terms are synonyms - at least I'm not seeing a difference between these two terms. Can you please explain the difference or provide a link to something that does? For reference, this is what Google is pulling up on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canonicalization usually means adding all the missing parts, such as the scheme and authority, whereas normalization / resolution usually just means making a path absolute.
| } | ||
|
|
||
| protected String normalizePath(String path) { | ||
| path = SLASHES.matcher(path).replaceAll("/"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is always correct. IIRC on Azure /foo/bar is not the same as /foo//bar. Do you know how S3 treats this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is just a special case for file:///... which should really just be treated as an empty authority
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes, great catch. Can't believe I forgot about this but you're correct, S3 is the same as well. I will make changes to make sure this is only happening for local files then.
| if (path.endsWith("/") || path.startsWith("/")) { | ||
| path = | ||
| path.substring( | ||
| path.startsWith("/") && path.length() > 1 ? 1 : 0, | ||
| path.endsWith("/") ? path.length() - 1 : path.length()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringUtils.strip
| testStorageLocation = StorageLocation.of("gcs://////path'/////*to===///\"file"); | ||
| Assertions.assertEquals("gcs://path'/*to===/\"file", testStorageLocation.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above -- is this actually correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it work right now in Polaris?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not correct. Actually when looking through Java URI code - we are not currently applying any normalization to URLs...
And I'm not sure that normalizing URLs is entirely necessary either. We've worked through the comments above that double slashes shouldn't be modified and similarly, "." and ".." also shouldn't be normalized for blob storage.
As a result, I think this might be trying to do too much. I'm going to stop work on this and push a new branch that simply just takes the URI create off for the Blob Storage locations and adds additional tests. Will link that PR here when I push it.
This is a retry at #1586, which I'll close if this PR is on the right direction instead. Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in #1545.
|
Closing in favor of #1604 |
This is a retry at apache#1586, which I'll close if this PR is on the right direction instead. Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in apache#1545.
* fix: Remove duplicated code in IcebergCatalog (apache#1681) * Fix SparkClient listGenericTable to use ListGenericTablesRESTResponse ListTableResponse is the class used by iceberg endpoint, which is the same as ListGenericTablesRESTResponse. However, we are suppose to use ListGenericTablesRESTResponse to be correct * fix: Remove info log about deprecated internal method from PolarisConfiguration (apache#1672) Remove the INFO log about calls to this method in Polaris, because users cannot do anything about these messages. Phasing out old property names requires coordination with users (e.g. release notes), so it is not a matter of merely avoiding calls to that method in Polaris code. Fixes apache#1666 * Create a single binary distribution bundle (apache#1589) * fix(quickstart): Correct Quickstart Instructions (apache#1673) * Remove Java URI validations for Blob Storage providers (apache#1604) This is a retry at apache#1586, which I'll close if this PR is on the right direction instead. Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in apache#1545. * Improve test coverage for invalid inputs in Policy APIs (apache#1665) * Fix getting-started docker start by PR apache#1532 (apache#1687) * Fix the manual test broken by PR apache#1532 (apache#1688) * Fix credentials printing twice (apache#1682) * main: Update dependency com.diffplug.spotless:spotless-plugin-gradle to v7.0.4 (apache#1690) * INFO: last merged commit 493de03 --------- Co-authored-by: Alexandre Dutra <adutra@users.noreply.github.com> Co-authored-by: gh-yzou <167037035+gh-yzou@users.noreply.github.com> Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com> Co-authored-by: Yufei Gu <yufei@apache.org> Co-authored-by: Adnan Hemani <adnan.h@berkeley.edu> Co-authored-by: William Hyun <william@apache.org> Co-authored-by: Christopher Lambert <xn137@gmx.de> Co-authored-by: Mend Renovate <bot@renovateapp.com>
Fixes #1545. Further context on that ticket.
Java URIs do not allow special characters, such as quotes, as part of RFC 3986. As a result, it is not possible to use paths that contain such characters with Java URIs, which we use to compare storage locations when creating a new catalog.
This change refactors that code to only use URI if the location provided is a local filesystem path - else, it will use Hadoop's
Pathclass in order to normalize the location.