-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lower bound for groupCubeSize #107
Conversation
…ny cubes for large datasets
The sampling test failed because now the groupCubeSize is 1000 and there are only a handful of cubes. Use a sampling fraction of 0.01 can solve the problem, should be modify the test? @osopardo1 |
This test in particular only checks if the filtering of the files is done correctly. Yes, is best to do it with a sample of 0.01. But then we need to test better the number we are using to solve the issue. Or at least, test it in corner cases. |
Codecov Report
@@ Coverage Diff @@
## main #107 +/- ##
=======================================
Coverage 89.34% 89.35%
=======================================
Files 60 60
Lines 1286 1287 +1
Branches 100 106 +6
=======================================
+ Hits 1149 1150 +1
Misses 137 137
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
When the number of elements in a table is large, the
groupCubeSize
used byCubeWeightsBuilder
tends to 1. This can leads to a large number of estimated Cube Weight pairs that can cause OOM errors.For instance, using default values for
bufferCapacity
anddesiredCubeSize
, a table with 2600M records split between 1200 partitions, thegroupCubeSize
is 190. The algorithm estimates 37K cubes, among which only around 1K are used. The workaround is to set a minimum value forgroupCubeSize
, for instance 1000, so the number of estimated cubes won't be as large. Using this setting, the number of estimated cubes is reduced to about 5K.Test added:
Using default
bufferCapacity
andDesiredCubeSize
, make sure that the minimum value forgroupCubeSize
is respected for differentnumElements
.