Skip to content

Conversation

@adnanhemani
Copy link
Contributor

Fixes #1545. Further context on that ticket.

Java URIs do not allow special characters, such as quotes, as part of RFC 3986. As a result, it is not possible to use paths that contain such characters with Java URIs, which we use to compare storage locations when creating a new catalog.

This change refactors that code to only use URI if the location provided is a local filesystem path - else, it will use Hadoop's Path class in order to normalize the location.

@adnanhemani
Copy link
Contributor Author

I originally used Hadoop Path class because it is the closest thing to what we want and we already bring the Hadoop Client API dependency in other modules, such as Service. I don't personally think adding a client API to this module is the worst thing in the world, but not adding it just makes us to have to normalize the paths ourselves.

So I'm not strong on doing this either way but given there's unanimous pushback on it, I've implemented the normalization manually in the latest push.

@eric-maynard - I thought about creating new XYZLocation types, but I found them to just end up being very simplistic classes after drafting the changes. S3 and GCS don't have the same level of complexity that Azure has, so I've implemented the normalization in StorageLocation itself. Please let me know if this is something you feel strongly about. Thanks all!

}
}

private static String scheme(String location) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: method name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to getScheme

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this get missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - seems like things didn't push :( Will push with the next set of comments.

String path;
if (scheme == null) {
path = location;
scheme = "file";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have an enum or something for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, we are parsing them from the user-provided input. I tried hard-typing these and rejecting any schemes that are not explicitly supported (but including file formats that don't explicitly state a scheme) and then a barrage of test cases started failing due to weird input we've put in - some of them intentionally, others not. So I'm not really in favor of making such a massive shift, it will dramatically increase the scope of this change.

path = location.substring(scheme.length() + 1);
}
path = normalizePath(path);
if (scheme.equals("file")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, let's at least put this in a const

Copy link
Contributor Author

@adnanhemani adnanhemani May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that, that's reasonable. Added in next revision.

if (scheme.equals("file")) {
this.location = URI.create(LOCAL_PATH_PREFIX + path).toString();
} else {
this.location = scheme + "://" + path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this duplicate the ://?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - the scheme will be without the "://", as we parse right up to the ":" for it. The path is normalized in the normalizePath function such that we will remove any starting slashes and the parsing for path starts after the ":" character.

Comment on lines 54 to 53
} else {
this.location = new Path(location).toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this branch used to cover relative paths, but now it won't

Copy link
Contributor Author

@adnanhemani adnanhemani May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Trying to deal with these is only going to muddy the waters further here if we try to implement it manually :( And I don't think this is a case that we really need to care heavily for cloud storage providers given that they are really just pseudo-filesystems.

If this is a strong requirement, I strongly recommend that we go back to Hadoop's Path class instead of trying this manually....


import org.apache.polaris.core.storage.StorageLocation;
import org.assertj.core.api.Assertions;
import org.junit.jupiter.api.Assertions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not assertj? They are generally more user-friendly, IMHO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, I was testing some assertionThrows and JUnit was much easier to work with for those imo. But now that those are gone, I agree for these cases - I'll change them back.

Assertions.assertEquals(testStorageLocation, baseStorageLocation);

testStorageLocation = StorageLocation.of("s3://////path'/////*to===///\"file");
Assertions.assertEquals("s3://path'/*to===/\"file", testStorageLocation.toString());
Copy link
Contributor

@dimas-b dimas-b May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in this URI path' is not a "path" but "authority" per RFC 3986

public void testGcpLocations() {
StorageLocation baseStorageLocation = StorageLocation.of("gcs://path/to/file");
StorageLocation testStorageLocation = StorageLocation.of("gcs://path/to/////file");
Assertions.assertEquals(testStorageLocation, baseStorageLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test materially different from testAwsLocations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In spirit, no. Let me merge this case and the AWS cases.

public void testAwsLocations() {
StorageLocation baseStorageLocation = StorageLocation.of("s3://path/to/file");
StorageLocation testStorageLocation = StorageLocation.of("s3://path/to/////file");
Assertions.assertEquals(testStorageLocation, baseStorageLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is good, but I believe more test cases are needed to validate the handling of upper / lower case letters and leading / trailing slashes and the special cases of missing scheme.

Copy link
Contributor Author

@adnanhemani adnanhemani May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out - let me add them. Missing scheme is in the function below with this variable: slashLeadingLocation

@dimas-b
Copy link
Contributor

dimas-b commented May 15, 2025

Re: #1545 : this PR fixes part of the problem, but I'm not sure it fixes all that needs to be fixed. I believe the use case in #1545 needs to be tested end-to-end (with reading from tables) to be sure it is fixed.

@adnanhemani
Copy link
Contributor Author

I believe the use case in #1545 needs to be tested end-to-end (with reading from tables) to be sure it is fixed.

@dimas-b - I think that's a reasonable ask. Are you saying that we should do this manually once or that this needs an integration test?

@dimas-b
Copy link
Contributor

dimas-b commented May 16, 2025

I mostly meant that this PR should not claim that it Fixes #1545 (in the description) :)

I did not mean to expand the scope of this PR. We can merge this incremental improvement, re-test the issue manually and follow-up with more PRs and integration tests as required.

if (schemePos > 0) {
return location.substring(0, schemePos).toLowerCase(Locale.ROOT);
}
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this is passed early in the making of the StorageLocation and can, as a result, get filesystem locations like /var/sample/location. This is how we're figuring out that this is a local file path.

return null;
}

protected String normalizePath(String path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really normalization, maybe it's canonicalization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like both terms are synonyms - at least I'm not seeing a difference between these two terms. Can you please explain the difference or provide a link to something that does? For reference, this is what Google is pulling up on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canonicalization usually means adding all the missing parts, such as the scheme and authority, whereas normalization / resolution usually just means making a path absolute.

}

protected String normalizePath(String path) {
path = SLASHES.matcher(path).replaceAll("/");
Copy link
Contributor

@eric-maynard eric-maynard May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is always correct. IIRC on Azure /foo/bar is not the same as /foo//bar. Do you know how S3 treats this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is just a special case for file:///... which should really just be treated as an empty authority

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, great catch. Can't believe I forgot about this but you're correct, S3 is the same as well. I will make changes to make sure this is only happening for local files then.

Comment on lines +125 to +130
if (path.endsWith("/") || path.startsWith("/")) {
path =
path.substring(
path.startsWith("/") && path.length() > 1 ? 1 : 0,
path.endsWith("/") ? path.length() - 1 : path.length());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringUtils.strip

Comment on lines +47 to +48
testStorageLocation = StorageLocation.of("gcs://////path'/////*to===///\"file");
Assertions.assertEquals("gcs://path'/*to===/\"file", testStorageLocation.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above -- is this actually correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work right now in Polaris?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not correct. Actually when looking through Java URI code - we are not currently applying any normalization to URLs...

And I'm not sure that normalizing URLs is entirely necessary either. We've worked through the comments above that double slashes shouldn't be modified and similarly, "." and ".." also shouldn't be normalized for blob storage.

As a result, I think this might be trying to do too much. I'm going to stop work on this and push a new branch that simply just takes the URI create off for the Blob Storage locations and adds additional tests. Will link that PR here when I push it.

@adnanhemani
Copy link
Contributor Author

Created #1604. Will keep this PR open until #1604 merges or at least gets confirmed as the right way to go - but please do not review this PR until further notice.

@adnanhemani adnanhemani changed the title Allow for quotes in path for blob storage [PAUSED] Allow for quotes in path for blob storage May 19, 2025
@adnanhemani adnanhemani changed the title [PAUSED] Allow for quotes in path for blob storage [PAUSED - DO NOT REVIEW] Allow for quotes in path for blob storage May 19, 2025
@snazy snazy marked this pull request as draft May 21, 2025 08:56
eric-maynard pushed a commit that referenced this pull request May 27, 2025
This is a retry at #1586, which I'll close if this PR is on the right direction instead.

Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in #1545.
@eric-maynard
Copy link
Contributor

Closing in favor of #1604

@github-project-automation github-project-automation bot moved this from PRs In Progress to Done in Basic Kanban Board May 27, 2025
williamhyun pushed a commit to williamhyun/polaris that referenced this pull request May 27, 2025
This is a retry at apache#1586, which I'll close if this PR is on the right direction instead.

Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in apache#1545.
snazy added a commit to snazy/polaris that referenced this pull request Jun 13, 2025
* fix: Remove duplicated code in IcebergCatalog (apache#1681)

* Fix SparkClient listGenericTable to use ListGenericTablesRESTResponse

ListTableResponse is the class used by iceberg endpoint, which is the same as ListGenericTablesRESTResponse. However, we are suppose to use ListGenericTablesRESTResponse to be correct

* fix: Remove info log about deprecated internal method from PolarisConfiguration (apache#1672)


Remove the INFO log about calls to this method in Polaris, because users cannot do anything about these messages.

Phasing out old property names requires coordination with users (e.g. release notes), so it is not a matter of merely avoiding calls to that method in Polaris code.

Fixes apache#1666

* Create a single binary distribution bundle (apache#1589)

* fix(quickstart): Correct Quickstart Instructions (apache#1673)

* Remove Java URI validations for Blob Storage providers (apache#1604)

This is a retry at apache#1586, which I'll close if this PR is on the right direction instead.

Java URI does not actually apply any normalization to URIs if we do not call URI.normalize() (which we currently do not). Additionally, blob storage providers like S3 and GCS can provide ".." and "." as valid fragments in URLs - which Java URI would attempt to normalize incorrectly. As a result, attempting to validate and/or normalize URIs for blob storage providers using the Java URI class is the incorrect behavior. While we may want to add location validation via regex later, removing it first should at least unblock the bug we see in apache#1545.

* Improve test coverage for invalid inputs in Policy APIs (apache#1665)

* Fix getting-started docker start by PR apache#1532 (apache#1687)

* Fix the manual test broken by PR apache#1532 (apache#1688)

* Fix credentials printing twice (apache#1682)

* main: Update dependency com.diffplug.spotless:spotless-plugin-gradle to v7.0.4 (apache#1690)

* INFO: last merged commit 493de03

---------

Co-authored-by: Alexandre Dutra <adutra@users.noreply.github.com>
Co-authored-by: gh-yzou <167037035+gh-yzou@users.noreply.github.com>
Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com>
Co-authored-by: Yufei Gu <yufei@apache.org>
Co-authored-by: Adnan Hemani <adnan.h@berkeley.edu>
Co-authored-by: William Hyun <william@apache.org>
Co-authored-by: Christopher Lambert <xn137@gmx.de>
Co-authored-by: Mend Renovate <bot@renovateapp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to use AWS S3 locations with quotes in path

5 participants