-
Notifications
You must be signed in to change notification settings - Fork 115
change gharchive query to reduce memory blowup, disable Delta #5026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Does it seem like datafusion loads all columns in this case? I wonder if it wants the null counts for all arrays. |
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Benchmarks: FineWeb S3Summary
Detailed Results Table
|
Codecov Report✅ All modified and coverable lines are covered by tests. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on S3Summary
Detailed Results Table
|
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
|
Oh huh, we don't push the filter down at all apparently
I have been under the mistaken impression that we are able to. I see also #5024 |
|
Merging @asubiotto's change helps a lot here. I'm going to review and try and merge that and then revisit this |
09ed4ff to
b426be7
Compare
Benchmarks: GitHub Archive (NVMe)Summary
Detailed Results Table
|
Benchmarks: GitHub Archive (S3)Summary
Detailed Results Table
|
|
Why are we disabling RLE? |
I'm pretty sure at this point the culprit is delta. Applying this diff running in debug, did not trigger the issue for GHArchive. The difference being that the indices are not delta encoded here yet. If the issue would be within RLE, this should also trigger the out of bounds access in my understanding. I did this in encode to see the uncompressed pritimitive array in case the assertion fires. |
|
I can just disable Delta if that is preferable |
|
The new query succeeds by the way without triggering an out of bounds, even with Delta + RLE enabled. Just ran this locally. |
|
That's because the query is no longer decoding the array that needs to be decoded to trigger #4973 , but the data is still being written corrupted |
Yeah makes sense |
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
47e3ec3 to
6f491c6
Compare
gatesn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it as temporary fix, then can bring the tests back
Signed-off-by: Andrew Duffy <andrew@a10y.dev>


Disable RLE scheme, change
select *toselect count(*). That in combination with #5024 stops us from OOMing every time we run this benchmark.