Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PackfileMaintenanceStep: Use second-largest pack-size for batch size #375

Merged
merged 1 commit into from
Jun 2, 2020

Conversation

derrickstolee
Copy link
Contributor

@derrickstolee derrickstolee commented May 4, 2020

When the total size of all pack-files is larger than 2 gigabytes, then
we currently use a strange mechanism for trying to repack everything
else into a single pack.

The "estimated output size" that we use for predicting the batch size
will under-count pack sizes when there are duplicates. This means that
the current model will actually no-op most of the time! By picking the
second-largest file size, the repack command will have a high
likelihood of doing something, and that size will grow geometrically
as we go. This strategy was recommended by @sluongng and it seems
to work better.

This only affects "small" repos. We will use 2 gigabytes as the
default batch-size for large repos.

@derrickstolee
Copy link
Contributor Author

/azp run microsoft.scalar

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@derrickstolee
Copy link
Contributor Author

/azp run microsoft.scalar

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dscho
Copy link
Member

dscho commented Jun 1, 2020

I guess I lack the context to understand... why is the second-largest pack's size a sensible choice? From what little I understand, I would have expected something like the total size of all packs except the largest one...

@derrickstolee
Copy link
Contributor Author

derrickstolee commented Jun 1, 2020

I guess I lack the context to understand... why is the second-largest pack's size a sensible choice? From what little I understand, I would have expected something like the total size of all packs except the largest one...

I felt like the comment @sluongng contributed on list was enough to justify this work, but now that I revisit the comment I feel like it isn't fully justified. Perhaps @sluongng could contribute some more about their experience here.

@sluongng
Copy link
Contributor

sluongng commented Jun 1, 2020

Hi folks,

This batch_size is more of a prediction over what the total repack size should at least be. And if the prediction is miss, then no repack happen.

So the current batch size calculation is:

batch_size = total_pack_size - max_pack_size

batch_size is first compared with expected_size in 1 and eventually total_size in 2.
The problem here is that expected_size and total_size could be varies depends on the nature of the packs in odb.

With the current calculation, I find it could 'miss' quite a lot of time, resulting in total_size < batch_size and fill_included_packs_batch() returned true, thus no repack happened. This lead to the number of packs keep growing (when prefetch step run) until the total_size get over batch_size that would repack all the small packs into 1.

I dont like that, mainly because its a bit hard to predict when the repack gona happen. And the spike of storage size of ODB which could happens time to time (although its not very significant).

Instead, if you choose the batch_size to be the second_biggest_pack_size + 1, a new second biggest packfile will always be created. This does not repack all the small packfiles like the original approach, but only the second biggest packfile and the latest pack (note how QSORT over mtime is used in fill_included_packs_batch() 3).

All and all, the second biggest pack size approach give me a more consistent result while the tradeoff is less packfiles are repacked each run. A way to correct this is just to ramp up the run interval of Repack steps, so that this job run more often.

I prefer this approach as the CPU / Ram / Disk usage for repack step is a lot more consistent. Instead of having a chance of all small packs being repacked, you consistently spend roughly the same CPU and disk I/O every X minutes to repack 2-3 packfiles.

Idk how to explain it better. I didnt have too much time to dig into midx format and why expected_size was being calculated in git/git@ce1e4a1 the way it is. But in retrospect, perhaps it could be tweaked somehow to make batch_size less of a prediction 🤔

@derrickstolee
Copy link
Contributor Author

Thanks for the context, @sluongng!

To summarize: the "estimated output size" that we use for predicting the batch size will under-count pack sizes when there are duplicates. This means that the current model will actually no-op most of the time! By picking the second-largest file size, the repack command will have a high likelihood of doing something, and that size will grow geometrically as we go.

And to reiterate on the "estimated output size": this is an important tool for allowing repacks in the GVFS protocol case with the 2gb size limit. When a user does not prefetch commits and trees for over 30 days, they will get a second "huge" pack of every commit and tree that was reachable 30 days ago. That will contain most of the objects from the older prefetch packs, but due to force-pushes some of those objects may still be interesting. Without this estimated output size, the oldest prefetch pack would never be repacked.

Does that help, @dscho?

@dscho
Copy link
Member

dscho commented Jun 1, 2020

Yes, thank you!

When the total size of all pack-files is larger than 2 gigabytes, then
we currently use a strange mechanism for trying to repack everything
else into a single pack. It was recommended that we should use the
second-largest pack-file size instead.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee
Copy link
Contributor Author

I modified the PR description to include the reasoning we worked out in the comments. Thanks!

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@derrickstolee derrickstolee merged commit e56f42e into microsoft:master Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants