Faster, better feature count estimates for intPKs #467

craigds · 2021-08-03T01:12:41Z

Description

feature-count estimates for integer PK datasets have been:

doing a lot of unnecessary work, especially for sparse trees
pretty slow, due to often running git diff-tree multiple times
crashing on some large datasets while trying to sample millions(!) of trees

This change rewrites int-PK estimates to use a similar algorithm as we're already using for string-PK datasets.

Estimates for string PKs

String-PK algorithm is to dive from the root down through the tree, until the number of child trees is less than half the branch factor. Then, sample a number of trees at that level, and average their blob count to produce an estimate.

With the string PK algorithm, we only need to dive down one tree path from the root, since we are assured of an even distribution of blobs.

Differences for int PKs

Like string-PK datasets, we can sample a number of trees from the root, and give up once we've sampled enough. We don't need to run a git subprocess.
Unlike string-PK datsets, int-PKs have no particular distribution, and in fact may often have a very uneven distribution. For example, a fairly typical commit might delete or modify 100 evenly-distributed features, and then insert 100 features with tightly-packed PKs.

For int PKs, we instead take a number of samples at the deepest level of trees, acquired by diving through random trees along the way.

This is intended to mitigate the effect of having both distributed updates and tightly-packed inserts in the same commit. It will be slower than the equivalent string-PK estimate, because more non-leaf trees will be traversed along the way to achieve the same sample size.

note

This doesn't fully avoid the problem of updates and inserts in the same commit. For instance a commit is likely to contain 999 leaf trees containing exactly one change, and 1 leaf tree containing 50 changes. So the algorithm sampling 16 trees might:

sample 15 trees containing 1 change and 1 tree containing 50 changes, and produce an estimate of (1/16 * 1000 * 50) + (15/16 * 1000 * 1)
--> 4063 features, with 1.6% probability
sample 16 trees containing 1 change, and produce an estimate of (16/16 * 1000 * 1)
--> 1000 features, with 98.4% probability

Checklist:

Have you reviewed your own change?
Have you included test(s)?
Have you updated the changelog?

feature-count estimates for integer PK datasets have been: * doing a lot of unnecessary work, especially for sparse trees * pretty slow, due to often running `git diff-tree` multiple times * crashing on some large datasets while trying to sample millions(!) of trees This change rewrites int-PK estimates to use a similar algorithm as we're already using for string-PK datasets. Like string-PK datasets, we can sample a number of trees from the root, and give up once we've sampled enough. We don't need to run a git subprocess. Unlike string-PK datsets, int-PKs have no particular distribution, and in fact may often have a very uneven distribution. For example, a fairly typical commit might delete or modify 100 evenly-distributed features, and then insert 100 features with tightly-packed PKs. String-PK algorithm is to dive from the root down through the tree, until the number of child trees is less than half the branch factor. Then, sample a number of trees at that level, and average their blob count to produce an estimate. With the string PK algorithm, we only need to dive down one tree path from the root, since we are assured of an even distribution of blobs. For int PKs, we instead take a number of samples at the deepest level of trees, acquired by diving through *random* trees along the way. This is intended to mitigate the effect of having both distributed updates and tightly-packed inserts in the same commit. It will be slower than the equivalent string-PK estimate, because more non-leaf trees will be traversed along the way to achieve the same sample size. note: This doesn't fully avoid the problem of updates and inserts in the same commit. For instance a commit is likely to contain 999 leaf trees containing exactly one change, and 1 leaf tree containing 50 changes. So the algorithm sampling 16 trees might: * sample 15 trees containing 1 change and 1 tree containing 50 changes, and produce an estimate of (1/16 * 1000 * 50) + (15/16 * 1000 * 1) --> 4063 features, with 1.6% probability * sample 16 trees containing 1 change, and produce an estimate of (16/16 * 1000 * 1) --> 1000 features, with 98.4% probability

craigds requested a review from olsen232 August 3, 2021 01:12

craigds force-pushed the fc-estimates-Ⅱ branch 3 times, most recently from 8eee5b0 to 08b83d4 Compare August 3, 2021 01:59

olsen232 approved these changes Aug 3, 2021

View reviewed changes

craigds force-pushed the fc-estimates-Ⅱ branch from 08b83d4 to 78d93f1 Compare August 3, 2021 02:42

craigds merged commit 37044ae into master Aug 3, 2021

craigds deleted the fc-estimates-Ⅱ branch August 3, 2021 02:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster, better feature count estimates for intPKs #467

Faster, better feature count estimates for intPKs #467

craigds commented Aug 3, 2021 •

edited

Loading

Faster, better feature count estimates for intPKs #467

Faster, better feature count estimates for intPKs #467

Conversation

craigds commented Aug 3, 2021 • edited Loading

Description

Estimates for string PKs

Differences for int PKs

note

Related links:

Checklist:

craigds commented Aug 3, 2021 •

edited

Loading