Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lower bound for groupCubeSize #107

Merged
merged 4 commits into from
Jun 3, 2022

Conversation

Jiaweihu08
Copy link
Member

When the number of elements in a table is large, the groupCubeSize used by CubeWeightsBuilder tends to 1. This can leads to a large number of estimated Cube Weight pairs that can cause OOM errors.

For instance, using default values for bufferCapacity and desiredCubeSize, a table with 2600M records split between 1200 partitions, the groupCubeSize is 190. The algorithm estimates 37K cubes, among which only around 1K are used. The workaround is to set a minimum value for groupCubeSize, for instance 1000, so the number of estimated cubes won't be as large. Using this setting, the number of estimated cubes is reduced to about 5K.

Test added:
Using default bufferCapacity and DesiredCubeSize, make sure that the minimum value for groupCubeSize is respected for different numElements.

@Jiaweihu08 Jiaweihu08 changed the title Min group cube size Add lower bound for groupCubeSize May 27, 2022
@Jiaweihu08 Jiaweihu08 added the type: bug Something isn't working label May 27, 2022
@Jiaweihu08 Jiaweihu08 linked an issue May 27, 2022 that may be closed by this pull request
@Jiaweihu08 Jiaweihu08 requested a review from osopardo1 May 27, 2022 15:33
@Jiaweihu08
Copy link
Member Author

The sampling test failed because now the groupCubeSize is 1000 and there are only a handful of cubes. Use a sampling fraction of 0.01 can solve the problem, should be modify the test? @osopardo1

@osopardo1
Copy link
Member

This test in particular only checks if the filtering of the files is done correctly. Yes, is best to do it with a sample of 0.01.

But then we need to test better the number we are using to solve the issue. Or at least, test it in corner cases.

@codecov
Copy link

codecov bot commented May 31, 2022

Codecov Report

Merging #107 (888d0f7) into main (55442d9) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #107   +/-   ##
=======================================
  Coverage   89.34%   89.35%           
=======================================
  Files          60       60           
  Lines        1286     1287    +1     
  Branches      100      106    +6     
=======================================
+ Hits         1149     1150    +1     
  Misses        137      137           
Impacted Files Coverage Δ
.../main/scala/io/qbeast/core/model/CubeWeights.scala 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55442d9...888d0f7. Read the comment docs.

@Jiaweihu08 Jiaweihu08 marked this pull request as ready for review June 2, 2022 17:46
Copy link
Member

@osopardo1 osopardo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@Jiaweihu08 Jiaweihu08 merged commit 4455fec into Qbeast-io:main Jun 3, 2022
@Jiaweihu08 Jiaweihu08 deleted the min-group-cube-size branch June 3, 2022 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Raise estimatedCubeWeights group size
2 participants