Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,15 @@

import jakarta.annotation.Nonnull;
import java.net.URI;
import java.util.Locale;
import java.util.regex.Pattern;
import org.apache.polaris.core.storage.azure.AzureLocation;

/** An abstraction over a storage location */
public class StorageLocation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, we need tests for the new case being supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call out. Included in the latest commit!

private final String location;
public static final String LOCAL_PATH_PREFIX = "file:///";
public static final Pattern SLASHES = Pattern.compile("/+");

/** Create a StorageLocation from a String path */
public static StorageLocation of(String location) {
Expand All @@ -40,12 +43,21 @@ public static StorageLocation of(String location) {
protected StorageLocation(@Nonnull String location) {
if (location == null) {
this.location = null;
} else if (location.startsWith("file:/") && !location.startsWith(LOCAL_PATH_PREFIX)) {
this.location = URI.create(location.replaceFirst("file:/+", LOCAL_PATH_PREFIX)).toString();
} else if (location.startsWith("/")) {
this.location = URI.create(location.replaceFirst("/+", LOCAL_PATH_PREFIX)).toString();
return;
}
String scheme = scheme(location);
String path;
if (scheme == null) {
path = location;
scheme = "file";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have an enum or something for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, we are parsing them from the user-provided input. I tried hard-typing these and rejecting any schemes that are not explicitly supported (but including file formats that don't explicitly state a scheme) and then a barrage of test cases started failing due to weird input we've put in - some of them intentionally, others not. So I'm not really in favor of making such a massive shift, it will dramatically increase the scope of this change.

} else {
path = location.substring(scheme.length() + 1);
}
path = normalizePath(path);
if (scheme.equals("file")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, let's at least put this in a const

Copy link
Contributor Author

@adnanhemani adnanhemani May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that, that's reasonable. Added in next revision.

this.location = URI.create(LOCAL_PATH_PREFIX + path).toString();
} else {
this.location = URI.create(location).toString();
this.location = scheme + "://" + path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this duplicate the ://?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - the scheme will be without the "://", as we parse right up to the ":" for it. The path is normalized in the normalizePath function such that we will remove any starting slashes and the parsing for path starts after the ":" character.

}
}

Expand Down Expand Up @@ -99,4 +111,23 @@ public boolean isChildOf(StorageLocation potentialParent) {
return slashTerminatedLocation.startsWith(slashTerminatedParentLocation);
}
}

private static String scheme(String location) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: method name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to getScheme

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this get missed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - seems like things didn't push :( Will push with the next set of comments.

int schemePos = location.indexOf(':');
if (schemePos > 0) {
return location.substring(0, schemePos).toLowerCase(Locale.ROOT);
}
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this is passed early in the making of the StorageLocation and can, as a result, get filesystem locations like /var/sample/location. This is how we're figuring out that this is a local file path.

}

protected String normalizePath(String path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really normalization, maybe it's canonicalization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like both terms are synonyms - at least I'm not seeing a difference between these two terms. Can you please explain the difference or provide a link to something that does? For reference, this is what Google is pulling up on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canonicalization usually means adding all the missing parts, such as the scheme and authority, whereas normalization / resolution usually just means making a path absolute.

path = SLASHES.matcher(path).replaceAll("/");
Copy link
Contributor

@eric-maynard eric-maynard May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is always correct. IIRC on Azure /foo/bar is not the same as /foo//bar. Do you know how S3 treats this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is just a special case for file:///... which should really just be treated as an empty authority

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, great catch. Can't believe I forgot about this but you're correct, S3 is the same as well. I will make changes to make sure this is only happening for local files then.

if (path.endsWith("/") || path.startsWith("/")) {
path =
path.substring(
path.startsWith("/") && path.length() > 1 ? 1 : 0,
path.endsWith("/") ? path.length() - 1 : path.length());
}
Comment on lines +125 to +130
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringUtils.strip

return path;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,41 @@
package org.apache.polaris.service.storage;

import org.apache.polaris.core.storage.StorageLocation;
import org.assertj.core.api.Assertions;
import org.junit.jupiter.api.Assertions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not assertj? They are generally more user-friendly, IMHO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, I was testing some assertionThrows and JUnit was much easier to work with for those imo. But now that those are gone, I agree for these cases - I'll change them back.

import org.junit.jupiter.api.Test;

public class StorageLocationTest {

@Test
public void testOfDifferentPrefixes() {
StorageLocation StandardLocation = StorageLocation.of("file:///path/to/file");
public void testAwsLocations() {
StorageLocation baseStorageLocation = StorageLocation.of("s3://path/to/file");
StorageLocation testStorageLocation = StorageLocation.of("s3://path/to/////file");
Assertions.assertEquals(testStorageLocation, baseStorageLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is good, but I believe more test cases are needed to validate the handling of upper / lower case letters and leading / trailing slashes and the special cases of missing scheme.

Copy link
Contributor Author

@adnanhemani adnanhemani May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out - let me add them. Missing scheme is in the function below with this variable: slashLeadingLocation

testStorageLocation = StorageLocation.of("s3://////path/////to///file");
Assertions.assertEquals(testStorageLocation, baseStorageLocation);

testStorageLocation = StorageLocation.of("s3://////path'/////*to===///\"file");
Assertions.assertEquals("s3://path'/*to===/\"file", testStorageLocation.toString());
Copy link
Contributor

@dimas-b dimas-b May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in this URI path' is not a "path" but "authority" per RFC 3986

}

@Test
public void testGcpLocations() {
StorageLocation baseStorageLocation = StorageLocation.of("gcs://path/to/file");
StorageLocation testStorageLocation = StorageLocation.of("gcs://path/to/////file");
Assertions.assertEquals(testStorageLocation, baseStorageLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test materially different from testAwsLocations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In spirit, no. Let me merge this case and the AWS cases.

testStorageLocation = StorageLocation.of("gcs://////path/////to///file");
Assertions.assertEquals(testStorageLocation, baseStorageLocation);

testStorageLocation = StorageLocation.of("gcs://////path'/////*to===///\"file");
Assertions.assertEquals("gcs://path'/*to===/\"file", testStorageLocation.toString());
Comment on lines +47 to +48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above -- is this actually correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work right now in Polaris?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not correct. Actually when looking through Java URI code - we are not currently applying any normalization to URLs...

And I'm not sure that normalizing URLs is entirely necessary either. We've worked through the comments above that double slashes shouldn't be modified and similarly, "." and ".." also shouldn't be normalized for blob storage.

As a result, I think this might be trying to do too much. I'm going to stop work on this and push a new branch that simply just takes the URI create off for the Blob Storage locations and adds additional tests. Will link that PR here when I push it.

}

@Test
public void testLocalFileLocations() {
StorageLocation storageLocation = StorageLocation.of("file:///path/to/file");
StorageLocation slashLeadingLocation = StorageLocation.of("/path/to/file");
StorageLocation fileSingleSlashLocation = StorageLocation.of("file:/path/to/file");
Assertions.assertThat(slashLeadingLocation.equals(StandardLocation)).isTrue();
Assertions.assertThat(fileSingleSlashLocation.equals(StandardLocation)).isTrue();
Assertions.assertEquals(slashLeadingLocation, storageLocation);
Assertions.assertEquals(fileSingleSlashLocation, storageLocation);
}
}