Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10998: [C++] Detect URIs where a filesystem path is expected #11997

Closed
wants to merge 3 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Dec 20, 2021

An occasional misunderstanding is to pass a URI to filesystem methods, where a regular path is expected.

Make these situations easier to diagnose by raising a specific error.

An occasional misunderstanding is to pass a URI to filesystem methods, where a regular path is expected.

Make these situations easier to diagnose by raising a specific error.
@github-actions
Copy link

@pitrou
Copy link
Member Author

pitrou commented Dec 20, 2021

@coryan Would you like to take a look at the GCS changes here?

Copy link
Contributor

@coryan coryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a minor nit in the tests.

@@ -797,6 +838,11 @@ TEST_F(GcsIntegrationTest, OpenInputStreamInfoInvalid) {
ASSERT_RAISES(IOError, fs->OpenInputStream(info));
}

TEST_F(GcsIntegrationTest, OpenInputStreamUri) {
auto fs = internal::MakeGcsFileSystemForTest(TestGcsOptions());
ASSERT_RAISES(Invalid, fs->OpenInputStream("gcs:" + PreexistingObjectPath()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just for testing purposes, but the conventional prefix for GCS is gs://, at least that is what gsutil uses:

https://cloud.google.com/storage/docs/gsutil#syntax

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you.
By the way, it seems other packages have adopted the gcs:// convention: https://gcsfs.readthedocs.io/en/latest/#integration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh... My guess (and let me stress that this would be just a guess) is that gs:// is more familiar to GCS users. I can live with gcs:// too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable change to me. Maybe drop a note that colons are not supported in filenames here (https://github.com/apache/arrow/blob/master/docs/source/cpp/io.rst#filesystems) where we mention . and .. are not supported?

@@ -402,6 +402,10 @@ struct S3Path {
std::vector<std::string> key_parts;

static Result<S3Path> FromString(const std::string& s) {
if (internal::IsLikelyUri(s)) {
return Status::Invalid(
"Expected a S3 object path of the form 'bucket/key...', got a URI: '", s, "'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Expected a S3 object path of the form 'bucket/key...', got a URI: '", s, "'");
"Expected an S3 object path of the form 'bucket/key...', got a URI: '", s, "'");

@pitrou
Copy link
Member Author

pitrou commented Dec 20, 2021

Maybe drop a note that colons are not supported in filenames here (https://github.com/apache/arrow/blob/master/docs/source/cpp/io.rst#filesystems) where we mention . and .. are not supported?

Hmm, they should be more or less supported except in the first segment of a path.

@pitrou pitrou closed this in 238b363 Dec 20, 2021
@pitrou pitrou deleted the ARROW-10998-uri-path-mismatch branch December 20, 2021 19:55
@ursabot
Copy link

ursabot commented Dec 20, 2021

Benchmark runs are scheduled for baseline = cfcce5a and contender = 238b363. 238b363 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️2.24% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.24% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

rafael-telles pushed a commit to rafael-telles/arrow that referenced this pull request Dec 20, 2021
An occasional misunderstanding is to pass a URI to filesystem methods, where a regular path is expected.

Make these situations easier to diagnose by raising a specific error.

Closes apache#11997 from pitrou/ARROW-10998-uri-path-mismatch

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@jorisvandenbossche
Copy link
Member

Cool, this is a nice usability improvement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants