-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ProvisionedThroughputExceededException when performing VACUUM using S3DynamoDBLogStore #3423
Closed
2 of 8 tasks
Comments
5 tasks
scottsand-db
added a commit
that referenced
this issue
Jul 29, 2024
…VACUUM calls (#3425) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves #3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No
nastra
pushed a commit
to nastra/delta
that referenced
this issue
Jul 30, 2024
…istFrom VACUUM calls (delta-io#3425) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves delta-io#3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No
EstherBear
pushed a commit
to EstherBear/delta
that referenced
this issue
Jul 31, 2024
…istFrom VACUUM calls (delta-io#3425) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves delta-io#3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No
scottsand-db
added a commit
to scottsand-db/delta
that referenced
this issue
Aug 1, 2024
…istFrom VACUUM calls (delta-io#3425) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves delta-io#3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No
scottsand-db
added a commit
to scottsand-db/delta
that referenced
this issue
Aug 1, 2024
…istFrom VACUUM calls (delta-io#3425) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves delta-io#3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No
scottsand-db
added a commit
to scottsand-db/delta
that referenced
this issue
Aug 1, 2024
…istFrom VACUUM calls (delta-io#3425) #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves delta-io#3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No
This was referenced Aug 1, 2024
scottsand-db
added a commit
that referenced
this issue
Aug 5, 2024
…ogStore::listFrom VACUUM calls (#3461) Cherry-pick 03bdf84 to branch 3.0 #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves #3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No
scottsand-db
added a commit
that referenced
this issue
Aug 5, 2024
…ogStore::listFrom VACUUM calls (#3462) Cherry-pick 03bdf84 to branch 3.1 #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves #3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No
scottsand-db
added a commit
that referenced
this issue
Aug 5, 2024
…ogStore::listFrom VACUUM calls (#3463) Cherry-pick 03bdf84 to branch 3.2 #### Which Delta project/connector is this regarding? - [X] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Resolves #3423. This PR updates the logic in `BaseExternalLogStore::listFrom` so that it does not make a request to get the latest entry from the external store (which is used to perform recovery operations) in the event that a non `_delta_log` file is being listed. This is useful for VACUUM operations which may do hundreds or thousands of list calls in the table directory and nested partition directories of parquet files. This is NOT the `_delta_log`. Thus, checking the external store during these list calls is (1) useless and unwanted as we are not listing the `_delta_log` so clearly now isn't the time to attempt to do a fixup, and (2) expensive. This PR makes it so that future VACUUM operations do not perform unnecessary calls to the external store (e.g. DyanamoDB). ## How was this patch tested? Unit tests and an integration test that actually runs VACUUM and compares the number of external store calls using the old/new logic. I ran that test myself 50 times, too, and it passed every time (therefore, not flaky). ## Does this PR introduce _any_ user-facing changes? No
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug
Which Delta project/connector is this regarding?
Describe the problem
Steps to reproduce
Perform VACUUM on a table configured to use the S3DynamoDBLogStore
Observed results
ProvisionedThroughputExceededException caused by:
Expected results
VACUUM shouldn't cause throughput-exceeded exception on the DynamoDB table.
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
The text was updated successfully, but these errors were encountered: