Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding logic to detect whether range requests are supported via ContentRange #1717

Conversation

carlopi
Copy link
Collaborator

@carlopi carlopi commented Apr 24, 2024

Mostly by @carstonhernke, minor refactor to unify with follow-up logic.

Fixes #1367, and closes #1702 since it implements the same functionality.

@carstonhernke
Copy link
Contributor

Thanks for looking this over! I just did some testing, and this new logic does not work with my data. Here are the details:

Here is the query I am testing with (pre-signed url is valid for ~11h):

SELECT * FROM 'https://public-duckdb-range-request-test.s3.us-east-1.amazonaws.com/duckdb_test_data.parquet?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEEcaCXVzLWVhc3QtMSJHMEUCIE%2F0aTAuIR%2BLhRcj0YO5sMYFwVg8dptLzR0f7Ep2S%2BlKAiEA33lVQFExAz6JvhM30uHST4YJcxSdXi%2Fgs2KzZpWiN7Yq8QIIkP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwzMzk1MjA0NDA4MDIiDJzkOn7RcDiOHuN6mirFAtL7aB9Q3xpGw4BpolagK6tIImUHMnrblV1ACeg8eNbzQL5IYe3zZ4EiUJ6jG0A3ExTtOEE4V6YUKFoqsdHhu9%2FUaLs9Bnmbd9Dle1HgdYscK7PGayfHnr1shxEbAgzRVbt1zNkpWQVtF1W4GVJziPANfO6DLcdAHQhpxfvHwflHMYoXQU4k0rigFors2o13CHD%2B0xDfUfVuGpS%2BkwChJQpSZBv8cWBWQ8zgnbwtd6a8ylkAqFWWW8NqFsn9sgyWjgaJPxMpFuLo5HzanbA41mOoqvS39x6W4O4NfMh9x3NNFs99aVUir8lnNTzzbyxXa0gXkUk2MDaR797yZaxhew8RCe0q6vMRZeDFPglOIYbPimZ6deRd%2BVhrKjYnVXGuOAMYp0H5aFbZnaV2gyFq4n2jJbFVvEJagMJyPhxaZlQ6u6YE1K0w%2Br2ksQY6swJnmbvWBWJrn7fQ6FsZQ05Sy5g7vWBzRvAdfJfLc3REl682ZnGip8plRDFOMtkLXIewEshw6w7o%2FP3KXqCJGAYQej7mSpPdjmhOy%2BxCrtwJmKOGgKY8jYVMkmnxjiIXeiTJaPwZpCBbOZlGIn6wwgNpIQseIKwniznuAE7AHLaNDonR%2BT0cq5MoU3tyVR0mT9NUOCOljmIB39vgbQmYfgJ9nqWVWqoeOC2Ky4A2FoMEndJWJAD1ovDRBpDwYC%2BjLCm9MZAG%2FfEH%2B2luDcDbOnXivlU8Z0vg3IKXaaUFvSq%2FgqU4A%2FGD037QN%2FnIEAkr7sn5wQwoOp5R1XNuX3rRJkC0xhAuk68mqAu9mihlxYNuGbun0bl97CAcmpOgZY%2Fhll0iVLQw5RLMD4DL3rVsoiQq7XQ3&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240424T150559Z&X-Amz-SignedHeaders=host&X-Amz-Expires=43200&X-Amz-Credential=ASIAU6DH6ZHRDO7M4AOP%2F20240424%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=9ddb6e8a5ea5d9845794630454ccdcf7207357fc283bc121e145449696696384' where postcode = 10787;
  • The first HEAD request returns with a 403
  • contentLength is null as the HEAD request returns a 403, meaning the second block does not get triggered
  • I believe file.dataProtocol is HTTP as I do not specify otherwise (this also prevents the second block from being triggered)

@carlopi
Copy link
Collaborator Author

carlopi commented Apr 24, 2024

Thanks for the feedback, and providing the means for testing in a easier way.

I am trying to check whether removing the line:

-                 if (((contentLength !== null) && (+contentLength > 1)) || file.dataProtocol == DuckDBDataProtocol.S3) {
+                {

I will do some testing on your testcase.

@carstonhernke
Copy link
Contributor

Yes I think removing that conditional logic could work, although I really don't know much about the S3 protocol and how it is applies here.

I think this issue could be generalized as 'detect support for range requests when HEAD isn't allowed'

@carlopi carlopi force-pushed the carstonhernke_fix-issue-#1367-presigned-token-range-requests branch from 138c187 to 21d69b9 Compare April 24, 2024 15:49
@carlopi
Copy link
Collaborator Author

carlopi commented Apr 24, 2024

@carstonhernke: I changed the code so that now actually compiles, and I think now it should do as intended, can you possibly double check?

The main problem here is that we might be regressing in some cases (say due to punitive performance of downloading the whole file, that might happen by mistake here), but for those you will have to turn off allowFullHttpReads.

@carstonhernke
Copy link
Contributor

Yes, just tested and it behaves as expected! (uses range requests)

@carlopi carlopi merged commit b1b835d into duckdb:main Apr 29, 2024
15 checks passed
@carlopi
Copy link
Collaborator Author

carlopi commented Apr 30, 2024

@carstonhernke: actually there is a problem: Content-Range is an unsafe header, and as such it will not be issued in Web embeddings of duckdb-wasm.

Possibly the logic might still make sense for node, but at the very least there should be added a setting, defaulting to false, on whether to use Content-Range or not.

@carstonhernke
Copy link
Contributor

I think this depends on the CORS configuration/ headers of the remote file which is being requested. Because Content-Range is not a safelisted header, it won't be accessible to JS code by default in browsers. However, it can be added to the CORS config.
"ExposeHeaders": [ "Access-Control-Allow-Origin", "Content-Range" ]

If the header is not available, but the server only returns a single byte, then I suppose we need another request to get the whole file. So in a case where this header is not available 90% of the time, I think you're right that it should be an optional setting to avoid adding unnecessary overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not getting partial reads of large parquet file in AWS S3 opened from pre-signed URL
2 participants