Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster, better feature count estimates for intPKs #467

Merged
merged 1 commit into from
Aug 3, 2021

Conversation

craigds
Copy link
Member

@craigds craigds commented Aug 3, 2021

Description

feature-count estimates for integer PK datasets have been:

  • doing a lot of unnecessary work, especially for sparse trees
  • pretty slow, due to often running git diff-tree multiple times
  • crashing on some large datasets while trying to sample millions(!) of trees

This change rewrites int-PK estimates to use a similar algorithm as we're already using for string-PK datasets.

Estimates for string PKs

String-PK algorithm is to dive from the root down through the tree, until the number of child trees is less than half the branch factor. Then, sample a number of trees at that level, and average their blob count to produce an estimate.

With the string PK algorithm, we only need to dive down one tree path from the root, since we are assured of an even distribution of blobs.

Differences for int PKs

Like string-PK datasets, we can sample a number of trees from the root, and give up once we've sampled enough. We don't need to run a git subprocess.
Unlike string-PK datsets, int-PKs have no particular distribution, and in fact may often have a very uneven distribution. For example, a fairly typical commit might delete or modify 100 evenly-distributed features, and then insert 100 features with tightly-packed PKs.

For int PKs, we instead take a number of samples at the deepest level of trees, acquired by diving through random trees along the way.

This is intended to mitigate the effect of having both distributed updates and tightly-packed inserts in the same commit. It will be slower than the equivalent string-PK estimate, because more non-leaf trees will be traversed along the way to achieve the same sample size.

note

This doesn't fully avoid the problem of updates and inserts in the same commit. For instance a commit is likely to contain 999 leaf trees containing exactly one change, and 1 leaf tree containing 50 changes. So the algorithm sampling 16 trees might:

  • sample 15 trees containing 1 change and 1 tree containing 50 changes, and produce an estimate of (1/16 * 1000 * 50) + (15/16 * 1000 * 1)
    --> 4063 features, with 1.6% probability

  • sample 16 trees containing 1 change, and produce an estimate of (16/16 * 1000 * 1)
    --> 1000 features, with 98.4% probability

Related links:

none

Checklist:

  • Have you reviewed your own change?
  • Have you included test(s)?
  • Have you updated the changelog?

@craigds craigds requested a review from olsen232 August 3, 2021 01:12
@craigds craigds force-pushed the fc-estimates-Ⅱ branch 3 times, most recently from 8eee5b0 to 08b83d4 Compare August 3, 2021 01:59
feature-count estimates for integer PK datasets have been:
* doing a lot of unnecessary work, especially for sparse trees
* pretty slow, due to often running `git diff-tree` multiple times
* crashing on some large datasets while trying to sample millions(!)
  of trees

This change rewrites int-PK estimates to use a similar algorithm as
we're already using for string-PK datasets.

Like string-PK datasets, we can sample a number of trees from the root,
and give up once we've sampled enough. We don't need to run a git
subprocess.
Unlike string-PK datsets, int-PKs have no particular distribution,
and in fact may often have a very uneven distribution. For example, a
fairly typical commit might delete or modify 100 evenly-distributed
features, and then insert 100 features with tightly-packed PKs.

String-PK algorithm is to dive from the root down through the tree,
until the number of child trees is less than half the branch factor.
Then, sample a number of trees at that level, and average their blob
count to produce an estimate.

With the string PK algorithm, we only need to dive down one tree path
from the root, since we are assured of an even distribution of blobs.

For int PKs, we instead take a number of samples at the deepest level
of trees, acquired by diving through *random* trees along the way.

This is intended to mitigate the effect of having both distributed
updates and tightly-packed inserts in the same commit. It will be slower
than the equivalent string-PK estimate, because more non-leaf trees will
be traversed along the way to achieve the same sample size.

note: This doesn't fully avoid the problem of updates and inserts in the
same commit. For instance a commit is likely to contain 999 leaf trees
containing exactly one change, and 1 leaf tree containing 50 changes.
So the algorithm sampling 16 trees might:

* sample 15 trees containing 1 change and 1 tree containing 50 changes,
  and produce an estimate of (1/16 * 1000 * 50) + (15/16 * 1000 * 1)
  --> 4063 features, with 1.6% probability
* sample 16 trees containing 1 change,
  and produce an estimate of (16/16 * 1000 * 1)
  --> 1000 features, with 98.4% probability
@craigds craigds merged commit 37044ae into master Aug 3, 2021
@craigds craigds deleted the fc-estimates-Ⅱ branch August 3, 2021 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants