Faster, better feature count estimates for intPKs #467
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
feature-count estimates for integer PK datasets have been:
git diff-tree
multiple timesThis change rewrites int-PK estimates to use a similar algorithm as we're already using for string-PK datasets.
Estimates for string PKs
String-PK algorithm is to dive from the root down through the tree, until the number of child trees is less than half the branch factor. Then, sample a number of trees at that level, and average their blob count to produce an estimate.
With the string PK algorithm, we only need to dive down one tree path from the root, since we are assured of an even distribution of blobs.
Differences for int PKs
Like string-PK datasets, we can sample a number of trees from the root, and give up once we've sampled enough. We don't need to run a git subprocess.
Unlike string-PK datsets, int-PKs have no particular distribution, and in fact may often have a very uneven distribution. For example, a fairly typical commit might delete or modify 100 evenly-distributed features, and then insert 100 features with tightly-packed PKs.
For int PKs, we instead take a number of samples at the deepest level of trees, acquired by diving through random trees along the way.
This is intended to mitigate the effect of having both distributed updates and tightly-packed inserts in the same commit. It will be slower than the equivalent string-PK estimate, because more non-leaf trees will be traversed along the way to achieve the same sample size.
note
This doesn't fully avoid the problem of updates and inserts in the same commit. For instance a commit is likely to contain 999 leaf trees containing exactly one change, and 1 leaf tree containing 50 changes. So the algorithm sampling 16 trees might:
sample 15 trees containing 1 change and 1 tree containing 50 changes, and produce an estimate of
(1/16 * 1000 * 50) + (15/16 * 1000 * 1)
--> 4063 features, with 1.6% probability
sample 16 trees containing 1 change, and produce an estimate of
(16/16 * 1000 * 1)
--> 1000 features, with 98.4% probability
Related links:
none
Checklist: