-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PackfileMaintenanceStep: Use second-largest pack-size for batch size #375
Conversation
/azp run microsoft.scalar |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run microsoft.scalar |
Azure Pipelines successfully started running 1 pipeline(s). |
I guess I lack the context to understand... why is the second-largest pack's size a sensible choice? From what little I understand, I would have expected something like the total size of all packs except the largest one... |
I felt like the comment @sluongng contributed on list was enough to justify this work, but now that I revisit the comment I feel like it isn't fully justified. Perhaps @sluongng could contribute some more about their experience here. |
Hi folks, This So the current batch size calculation is:
With the current calculation, I find it could 'miss' quite a lot of time, resulting in I dont like that, mainly because its a bit hard to predict when the repack gona happen. And the spike of storage size of ODB which could happens time to time (although its not very significant). Instead, if you choose the batch_size to be the All and all, the second biggest pack size approach give me a more consistent result while the tradeoff is less packfiles are repacked each run. A way to correct this is just to ramp up the run interval of Repack steps, so that this job run more often. I prefer this approach as the CPU / Ram / Disk usage for repack step is a lot more consistent. Instead of having a chance of all small packs being repacked, you consistently spend roughly the same CPU and disk I/O every X minutes to repack 2-3 packfiles. Idk how to explain it better. I didnt have too much time to dig into midx format and why |
Thanks for the context, @sluongng! To summarize: the "estimated output size" that we use for predicting the batch size will under-count pack sizes when there are duplicates. This means that the current model will actually no-op most of the time! By picking the second-largest file size, the And to reiterate on the "estimated output size": this is an important tool for allowing repacks in the GVFS protocol case with the 2gb size limit. When a user does not prefetch commits and trees for over 30 days, they will get a second "huge" pack of every commit and tree that was reachable 30 days ago. That will contain most of the objects from the older prefetch packs, but due to force-pushes some of those objects may still be interesting. Without this estimated output size, the oldest prefetch pack would never be repacked. Does that help, @dscho? |
Yes, thank you! |
When the total size of all pack-files is larger than 2 gigabytes, then we currently use a strange mechanism for trying to repack everything else into a single pack. It was recommended that we should use the second-largest pack-file size instead. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
I modified the PR description to include the reasoning we worked out in the comments. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
When the total size of all pack-files is larger than 2 gigabytes, then
we currently use a strange mechanism for trying to repack everything
else into a single pack.
The "estimated output size" that we use for predicting the batch size
will under-count pack sizes when there are duplicates. This means that
the current model will actually no-op most of the time! By picking the
second-largest file size, the
repack
command will have a highlikelihood of doing something, and that size will grow geometrically
as we go. This strategy was recommended by @sluongng and it seems
to work better.
This only affects "small" repos. We will use 2 gigabytes as the
default batch-size for large repos.