Skip to content

Conversation

@derrickstolee
Copy link

@derrickstolee derrickstolee commented Dec 6, 2018

The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

  • 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index.

  • 'git multi-pack-index repack --batch-size=': Starting from the oldest pack-files covered by the multi-pack-index, find those whose on-disk size is below the batch size until we have a collection of packs whose sizes add up to the batch size. Create a new pack containing all objects that the multi-pack-index references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model:

  1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files.

  2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted.

  • Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info).

We plan to use this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs.

@derrickstolee derrickstolee changed the title [DO NOT MERGE] Generating installer for multi-pack-index expire/repack [WIP] Create 'expire' and 'repack' subcommands for multi-pack-index Dec 11, 2018
@derrickstolee derrickstolee changed the base branch from rebase-git-for-windows-to-v2.20.0-rc0 to vfs-2.20.0 December 11, 2018 15:06
@derrickstolee derrickstolee reopened this Dec 11, 2018
@derrickstolee derrickstolee force-pushed the midx-expire/upstream branch 2 times, most recently from 282fdb3 to ab02108 Compare December 17, 2018 21:58
@derrickstolee derrickstolee changed the title [WIP] Create 'expire' and 'repack' subcommands for multi-pack-index Create 'expire' and 'repack' subcommands for multi-pack-index Dec 17, 2018
@derrickstolee derrickstolee changed the base branch from vfs-2.20.0 to vfs-2.20.1 December 17, 2018 21:58
@derrickstolee derrickstolee reopened this Dec 17, 2018
@derrickstolee derrickstolee force-pushed the midx-expire/upstream branch 2 times, most recently from 40e1285 to e552d50 Compare December 20, 2018 15:14
Copy link
Member

@kewillford kewillford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions for my understanding

The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.

The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
We will add new subcommands to the multi-pack-index, and that will
make the documentation a bit messier. Clean up the 'verb'
descriptions by renaming the concept to 'subcommand' and removing
the reference to the object directory.

Helped-by: Stefan Beller <sbeller@google.com>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' subcommand is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When writing a multi-pack-index, we keep track of an integer
permutation, tracking the list of pack-files that we know about
(both from the existing multi-pack-index and the new pack-files
being introduced) and converting them into a sorted order for
the new multi-pack-index.

In anticipation of dropping pack-files from the existing multi-
pack-index, refactor the logic around how we track this permutation.

First, insert the permutation into the pack_list structure. This
allows us to grow the permutation dynamically as we add packs.

Second, fill the permutation with values corresponding to their
position in the list of pack-files, sorted as follows:

  1. The pack-files in the existing multi-pack-index,
     sorted lexicographically.

  2. The pack-files not in the existing multi-pack-index,
     sorted as discovered from the filesystem.

There is a subtle thing in how we initialize this permutation,
specifically how we use 'i' for the initial value. This will
matter more when we implement the logic for dropping existing
packs, as we will create holes in the ordering.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The 'git multi-pack-index expire' command looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire command. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The verb will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
To repack using a multi-pack-index, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, adding the packs to a list if they are smaller than the
given pack-size. Finally, collect the objects from the multi-pack-
index that are in those packs and send them to 'git pack-objects'.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the size of the objects
instead of the size of the pack-files. This allows repacking a
large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. This object-size idea could be a direction for
future expansion in this area.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee derrickstolee merged commit 9b9022d into microsoft:vfs-2.20.1 Dec 21, 2018
derrickstolee added a commit to microsoft/VFSForGit that referenced this pull request Dec 21, 2018
Create a new maintenance step for cleaning up the packfiles in the shared object cache.

This is done in two steps:

1. `git multi-pack-index expire`: This deletes the pack-files whose objects appear in newer pack-files. The multi-pack-index prevents `git` from looking at these packs. Rewrites the `multi-pack-index` to no longer refer to these (deleted) packs.

2. `git multi-pack-index repack --batch-size=<size>` inspects packs covered by the multi-pack-index in modified-time order (ascending). Greedily selects a batch of packs whose file sizes are all less than "size", but that sum up to at least "size". Then generate a new pack-file containing the objects that are uniquely referenced by the `multi-pack-index`.

See [this PR in microsoft/git](microsoft/git#84) for details of the `git` implementation.

There are some important details:

* The `expire` command will not delete packs that are marked with a `.keep` file. So, we can keep our most-recent prefetch pack around by writing a `.keep` file. This avoids any change to the logic for the most-recent good prefetch timestamp.

* The `expire` command frequently fails to actually delete `.idx` files in a VFS for Git repo. This is due to something holding a handle to the files, as later we are able to delete them; we've seen this problem before with scripts that try to clean up the pack directory. We clean up `.idx` files that are not paired with a `.pack` after we run the `expire` step. This can generate warnings in the `multi-pack-index write` commands as it tries to add the packs based on their `.idx` files. The commands succeed, however.

* According to the [`git pack-objects` documentation](https://git-scm.com/docs/git-pack-objects#git-pack-objects---windowltngt), we should set the `pack.window` and `pack.depth` config settings to zero to avoid delta calculations. This saves significant computation time when our packs contain blobs, as the deltification algorithm can be very slow. If we were packing only commits and trees, the problem is not as significant. It is important that we don't set this globally, that way we actually compute deltas on push. Instead, just set the settings using the `-c` mechanism in Git.
dscho pushed a commit that referenced this pull request Feb 27, 2019
…ulti-pack-index

The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

* 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index.

* 'git multi-pack-index repack --batch-size=<size>': Starting from the oldest pack-files covered by the multi-pack-index, find those whose on-disk size is below the batch size until we have a collection of packs whose sizes add up to the batch size. Create a new pack containing all objects that the multi-pack-index references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model:

1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files.

2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted.

* Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info).

We plan to use this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs.
dscho pushed a commit that referenced this pull request Mar 29, 2019
…ulti-pack-index

The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

* 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index.

* 'git multi-pack-index repack --batch-size=<size>': Starting from the oldest pack-files covered by the multi-pack-index, find those whose on-disk size is below the batch size until we have a collection of packs whose sizes add up to the batch size. Create a new pack containing all objects that the multi-pack-index references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model:

1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files.

2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted.

* Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info).

We plan to use this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs.
dscho pushed a commit that referenced this pull request May 25, 2019
…ulti-pack-index

The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

* 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index.

* 'git multi-pack-index repack --batch-size=<size>': Starting from the oldest pack-files covered by the multi-pack-index, find those whose on-disk size is below the batch size until we have a collection of packs whose sizes add up to the batch size. Create a new pack containing all objects that the multi-pack-index references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model:

1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files.

2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted.

* Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info).

We plan to use this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs.
dscho pushed a commit that referenced this pull request May 27, 2019
…ulti-pack-index

The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

* 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index.

* 'git multi-pack-index repack --batch-size=<size>': Starting from the oldest pack-files covered by the multi-pack-index, find those whose on-disk size is below the batch size until we have a collection of packs whose sizes add up to the batch size. Create a new pack containing all objects that the multi-pack-index references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model:

1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files.

2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted.

* Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info).

We plan to use this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants