Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects #1134

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions Documentation/config/core.txt
Original file line number Diff line number Diff line change
Expand Up @@ -596,6 +596,14 @@ core.fsyncMethod::
* `writeout-only` issues pagecache writeback requests, but depending on the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
> [...]
> +	if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
> +		bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> +		if (!bulk_fsync_objdir)
> +			die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));

Should camel-case the config var, and we should have a die_errno() here
which tell us why we couldn't create it (possibly needing to ferry it up
from the tempfile API...)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Mon, Mar 21, 2022 at 7:43 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> > [...]
> > +     if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
> > +             bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> > +             if (!bulk_fsync_objdir)
> > +                     die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
>
> Should camel-case the config var, and we should have a die_errno() here
> which tell us why we couldn't create it (possibly needing to ferry it up
> from the tempfile API...)

Thanks for noticing the camelCasing.  The config var name was also
wrong. Now it will read:
> > +                     die(_("Could not create temporary object directory for core.fsyncMethod=batch"));

Do you have any recommendations on how to easily ferry the correct
errno out of tmp_objdir_create?
It looks like the remerge-diff usage has the same die behavior w/o
errno, and the builtin/receive-pack.c usage
doesn't die, but also loses the errno.  I'm concerned about preserving
the errno across the or tmp_objdir_destroy
calls.  I could introduce a temp errno var to preserve it across
those. Is that what you had in mind?

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> One major source of the cost of fsync is the implied flush of the
> hardware writeback cache within the disk drive. This commit introduces
> a new `core.fsyncMethod=batch` option that batches up hardware flushes.
> It hooks into the bulk-checkin plugging and unplugging functionality,
> takes advantage of tmp-objdir, and uses the writeout-only support code.
>
> When the new mode is enabled, we do the following for each new object:
> 1. Create the object in a tmp-objdir.
> 2. Issue a pagecache writeback request and wait for it to complete.
>
> At the end of the entire transaction when unplugging bulk checkin:
> 1. Issue an fsync against a dummy file to flush the hardware writeback
>    cache, which should by now have seen the tmp-objdir writes.
> 2. Rename all of the tmp-objdir files to their final names.
> 3. When updating the index and/or refs, we assume that Git will issue
>    another fsync internal to that operation. This is not the default
>    today, but the user now has the option of syncing the index and there
>    is a separate patch series to implement syncing of refs.

Re my question in
https://lore.kernel.org/git/220310.86r179ki38.gmgdl@evledraar.gmail.com/
(which you *partially* replied to per my reading, i.e. not the
fsync_nth() question) I still don't get why the tmp-objdir part of this
is needed.

For "git stash" which is one thing sped up by this let's go over what
commands/FS ops we do. I changed the test like this:
	
	diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
	index 3fc16944e9e..479a495c68c 100755
	--- a/t/t3903-stash.sh
	+++ b/t/t3903-stash.sh
	@@ -1383,7 +1383,7 @@ BATCH_CONFIGURATION='-c core.fsync=loose-object -c core.fsyncmethod=batch'
	 
	 test_expect_success 'stash with core.fsyncmethod=batch' "
	 	test_create_unique_files 2 4 fsync-files &&
	-	git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ &&
	+	strace -f git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ &&
	 	rm -f fsynced_files &&
	 
	 	# The files were untracked, so use the third parent,
	
Then we get this output, with my comments, and I snipped some output:
	 
	$ ./t3903-stash.sh --run=1-4,114 -vixd 2>&1|grep --color -e 89772c935031c228ed67890f9 -e .git/stash -e bulk_fsync -e .git/index
	[pid 14703] access(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
	[pid 14703] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
	[pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/tmp_obj_bdUlzu", ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0

Here we're creating the tmp_objdir() files. We then sync_file_range()
and close() this.

	[pid 14703] openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7", O_RDWR|O_CREAT|O_EXCL, 0600) = 9
	[pid 14703] unlink("/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7") = 0

This is the flushing of the "cookie" in do_batch_fsync().

	[pid 14703] newfstatat(AT_FDCWD, ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, 0) = 0
	[pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0

Here we're going through the object dir migration with
unplug_bulk_checkin().

	[pid 14703] unlink(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0
	newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0
	[pid 14705] access(".git/objects/tmp_objdir-bulk-fsync-0F7DGy/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
	[pid 14705] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = 0
	[pid 14705] utimensat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", NULL, 0) = 0
	[pid 14707] openat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", O_RDONLY|O_CLOEXEC) = 9

We then update the index itself, first a temporary index.stash :

    openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 8
    openat(AT_FDCWD, ".git/index.stash.19141", O_RDONLY) = 9
    newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0
    newfstatat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", {st_mode=S_IFREG|0644, st_size=927, ...}, 0) = 0
    rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141") = 0
    unlink(".git/index.stash.19141")        = 0

Followed by the same and a later rename of the actual index:

    [pid 19146] rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index") = 0

So, my question is still why the temporary object dir migration part of
this is needed.

We are writing N loose object files, and we write those to temporary
names already.

AFAIKT we could do all of this by doing the same
tmp/rename/sync_file_range dance on the main object store.

Then instead of the "bulk_fsync" cookie file don't close() the last file
object file we write until we issue the fsync on it.

But maybe this is all needed, I just can't understand from the commit
message why the "bulk checkin" part is being done.

I think since we've been over this a few times without any success it
would really help to have some example of the smallest set of syscalls
to write a file like this safely. I.e. this is doing (pseudocode):

    /* first the bulk path */
    open("bulk/x.tmp");
    write("bulk/x.tmp");
    sync_file_range("bulk/x.tmp");
    close("bulk/x.tmp");
    rename("bulk/x.tmp", "bulk/x");
    open("bulk/y.tmp");
    write("bulk/y.tmp");
    sync_file_range("bulk/y.tmp");
    close("bulk/y.tmp");
    rename("bulk/y.tmp", "bulk/y");
    /* Rename to "real" */
    rename("bulk/x", x");
    rename("bulk/y", y");
    /* sync a cookie */
    fsync("cookie");

And I'm asking why it's not:

    /* Rename to "real" as we go */
    open("x.tmp");
    write("x.tmp");
    sync_file_range("x.tmp");
    close("x.tmp");
    rename("x.tmp", "x");
    last_fd = open("y.tmp"); /* don't close() the last one yet */
    write("y.tmp");
    sync_file_range("y.tmp");
    rename("y.tmp", "y");
    /* sync a cookie */
    fsync(last_fd);

Which I guess is two questions:

 A. do we need the cookie, or can we re-use the fd of the last thing we
    write?
 B. Is the bulk indirection needed?

> +		fsync_or_die(fd, "loose object file");

Unrelated nit: this API is producing sentence lego unfriendly to
translators.

Should be made to take an enum or something, so we can emit the relevant
translated message in fsync_or_die(). Imagine getting:

	fsync error on '日本語は話せません'

Which this will do, just the other way around for non-English speakers
using the translation.

(The solution is also not to add _() here, since translators will want
to control the word order.)

> diff --git a/cache.h b/cache.h
> index 3160bc1e489..d1ae51388c9 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1040,7 +1040,8 @@ extern int use_fsync;
>  
>  enum fsync_method {
>  	FSYNC_METHOD_FSYNC,
> -	FSYNC_METHOD_WRITEOUT_ONLY
> +	FSYNC_METHOD_WRITEOUT_ONLY,
> +	FSYNC_METHOD_BATCH
>  };
>  
>  extern enum fsync_method fsync_method;
> @@ -1767,6 +1768,11 @@ void fsync_or_die(int fd, const char *);
>  int fsync_component(enum fsync_component component, int fd);
>  void fsync_component_or_die(enum fsync_component component, int fd, const char *msg);
>  
> +static inline int batch_fsync_enabled(enum fsync_component component)
> +{
> +	return (fsync_components & component) && (fsync_method == FSYNC_METHOD_BATCH);
> +}
> +
>  ssize_t read_in_full(int fd, void *buf, size_t count);
>  ssize_t write_in_full(int fd, const void *buf, size_t count);
>  ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
> diff --git a/config.c b/config.c
> index 261ee7436e0..0b28f90de8b 100644
> --- a/config.c
> +++ b/config.c
> @@ -1688,6 +1688,8 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
>  			fsync_method = FSYNC_METHOD_FSYNC;
>  		else if (!strcmp(value, "writeout-only"))
>  			fsync_method = FSYNC_METHOD_WRITEOUT_ONLY;
> +		else if (!strcmp(value, "batch"))
> +			fsync_method = FSYNC_METHOD_BATCH;
>  		else
>  			warning(_("ignoring unknown core.fsyncMethod value '%s'"), value);
>  
> diff --git a/object-file.c b/object-file.c
> index 5258d9ed827..bdb0a38328f 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1895,6 +1895,8 @@ static void close_loose_object(int fd)
>  
>  	if (fsync_object_files > 0)
>  		fsync_or_die(fd, "loose object file");
> +	else if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
> +		fsync_loose_object_bulk_checkin(fd);
>  	else
>  		fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd,
>  				       "loose object file");

This is related to the above comments about what minimum set of syscalls
are needed to trigger this "bulk" behavior, but it seems to me that this
whole API is avoiding just passing some new flags down to object-file.c
and friends.

For e.g. update-index that results in e.g. the "plug bulk" not being
aware of HASH_WRITE_OBJECT, so with dry-run writes and the like we'll do
the whole setup/teardown for nothing.

Which is another reason I wondered why this couldn't be a flagged passed
down to the object writing...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > One major source of the cost of fsync is the implied flush of the
> > hardware writeback cache within the disk drive. This commit introduces
> > a new `core.fsyncMethod=batch` option that batches up hardware flushes.
> > It hooks into the bulk-checkin plugging and unplugging functionality,
> > takes advantage of tmp-objdir, and uses the writeout-only support code.
> >
> > When the new mode is enabled, we do the following for each new object:
> > 1. Create the object in a tmp-objdir.
> > 2. Issue a pagecache writeback request and wait for it to complete.
> >
> > At the end of the entire transaction when unplugging bulk checkin:
> > 1. Issue an fsync against a dummy file to flush the hardware writeback
> >    cache, which should by now have seen the tmp-objdir writes.
> > 2. Rename all of the tmp-objdir files to their final names.
> > 3. When updating the index and/or refs, we assume that Git will issue
> >    another fsync internal to that operation. This is not the default
> >    today, but the user now has the option of syncing the index and there
> >    is a separate patch series to implement syncing of refs.
>
> Re my question in
> https://lore.kernel.org/git/220310.86r179ki38.gmgdl@evledraar.gmail.com/
> (which you *partially* replied to per my reading, i.e. not the
> fsync_nth() question) I still don't get why the tmp-objdir part of this
> is needed.
>

Sorry for not fully answering your question. I think part of the issue might be
background, where it's not clear to me what's different between your
understanding
and mine, so may not have included something that's questionable to
you but not to me.

Your syscall description below makes the issues very concrete, so I
think we'll get it this round :).

> For "git stash" which is one thing sped up by this let's go over what
> commands/FS ops we do. I changed the test like this:
>
>         diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
>         index 3fc16944e9e..479a495c68c 100755
>         --- a/t/t3903-stash.sh
>         +++ b/t/t3903-stash.sh
>         @@ -1383,7 +1383,7 @@ BATCH_CONFIGURATION='-c core.fsync=loose-object -c core.fsyncmethod=batch'
>
>          test_expect_success 'stash with core.fsyncmethod=batch' "
>                 test_create_unique_files 2 4 fsync-files &&
>         -       git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ &&
>         +       strace -f git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ &&
>                 rm -f fsynced_files &&
>
>                 # The files were untracked, so use the third parent,
>
> Then we get this output, with my comments, and I snipped some output:
>
>         $ ./t3903-stash.sh --run=1-4,114 -vixd 2>&1|grep --color -e 89772c935031c228ed67890f9 -e .git/stash -e bulk_fsync -e .git/index
>         [pid 14703] access(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
>         [pid 14703] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
>         [pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/tmp_obj_bdUlzu", ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0
>
> Here we're creating the tmp_objdir() files. We then sync_file_range()
> and close() this.
>
>         [pid 14703] openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7", O_RDWR|O_CREAT|O_EXCL, 0600) = 9
>         [pid 14703] unlink("/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7") = 0
>
> This is the flushing of the "cookie" in do_batch_fsync().
>
>         [pid 14703] newfstatat(AT_FDCWD, ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, 0) = 0
>         [pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0
>
> Here we're going through the object dir migration with
> unplug_bulk_checkin().
>
>         [pid 14703] unlink(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0
>         newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0
>         [pid 14705] access(".git/objects/tmp_objdir-bulk-fsync-0F7DGy/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
>         [pid 14705] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = 0
>         [pid 14705] utimensat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", NULL, 0) = 0
>         [pid 14707] openat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", O_RDONLY|O_CLOEXEC) = 9
>
> We then update the index itself, first a temporary index.stash :
>
>     openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 8
>     openat(AT_FDCWD, ".git/index.stash.19141", O_RDONLY) = 9
>     newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0
>     newfstatat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", {st_mode=S_IFREG|0644, st_size=927, ...}, 0) = 0
>     rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141") = 0
>     unlink(".git/index.stash.19141")        = 0
>
> Followed by the same and a later rename of the actual index:
>
>     [pid 19146] rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index") = 0
>
> So, my question is still why the temporary object dir migration part of
> this is needed.
>
> We are writing N loose object files, and we write those to temporary
> names already.
>
> AFAIKT we could do all of this by doing the same
> tmp/rename/sync_file_range dance on the main object store.
>

Why not the main object store? We want to maintain the invariant that any
name in the main object store refers to a file that durably has the
correct contents.
If we do sync_file_range and then rename, and then crash, we now have a file
in the main object store with some SHA name, whose contents may or may not
match the SHA.  However, if we ensure an fsync happens before the rename,
a crash at any point will leave us either with no file in the main
object store or
with a file that is durable on the disk.

> Then instead of the "bulk_fsync" cookie file don't close() the last file
> object file we write until we issue the fsync on it.
>
> But maybe this is all needed, I just can't understand from the commit
> message why the "bulk checkin" part is being done.
>
> I think since we've been over this a few times without any success it
> would really help to have some example of the smallest set of syscalls
> to write a file like this safely. I.e. this is doing (pseudocode):
>
>     /* first the bulk path */
>     open("bulk/x.tmp");
>     write("bulk/x.tmp");
>     sync_file_range("bulk/x.tmp");
>     close("bulk/x.tmp");
>     rename("bulk/x.tmp", "bulk/x");
>     open("bulk/y.tmp");
>     write("bulk/y.tmp");
>     sync_file_range("bulk/y.tmp");
>     close("bulk/y.tmp");
>     rename("bulk/y.tmp", "bulk/y");
>     /* Rename to "real" */
>     rename("bulk/x", x");
>     rename("bulk/y", y");
>     /* sync a cookie */
>     fsync("cookie");
>

The '/* Rename to "real" */' and '/* sync a cookie */' steps are
reversed in your above sequence. It should be
1: (for each file)
    a) open
    b) write
    c) sync_file_range
    d) close
    e) rename in tmp_objdir  -- The rename step is not required for
bulk-fsync. An earlier version of this series didn't do it, but
Jeff King pointed out that it was required for concurrency:
https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@coredump.intra.peff.net/

2: fsync something on the same volume to flush the filesystem log and
disk cache. This functions as a "barrier".
3: Rename to final names.  At this point we know that the "contents"
are durable, so if the final name exists, we can read through it to
get the data.

> And I'm asking why it's not:
>
>     /* Rename to "real" as we go */
>     open("x.tmp");
>     write("x.tmp");
>     sync_file_range("x.tmp");
>     close("x.tmp");
>     rename("x.tmp", "x");
>     last_fd = open("y.tmp"); /* don't close() the last one yet */
>     write("y.tmp");
>     sync_file_range("y.tmp");
>     rename("y.tmp", "y");
>     /* sync a cookie */
>     fsync(last_fd);
>
> Which I guess is two questions:
>
>  A. do we need the cookie, or can we re-use the fd of the last thing we
>     write?

We can re-use the FD of the last thing we write, but that results in a
tricker API which
is more intrusive on callers. I was originally using a lockfile, but
found a usage where
there was no lockfile in unpack-objects.

>  B. Is the bulk indirection needed?
>

Hopefully the explanation above makes it clear why we need the
indirection. To state it again,
we need a real fsync before creating the final name in the objdir,
otherwise on a crash a name
could exist that points at contents which could have been lost, since
they weren't durable. I
updated the comment in do_batch_fsync to make this a little clearer.

> > +             fsync_or_die(fd, "loose object file");
>
> Unrelated nit: this API is producing sentence lego unfriendly to
> translators.
>
> Should be made to take an enum or something, so we can emit the relevant
> translated message in fsync_or_die(). Imagine getting:
>
>         fsync error on '日本語は話せません'
>
> Which this will do, just the other way around for non-English speakers
> using the translation.
>
> (The solution is also not to add _() here, since translators will want
> to control the word order.)

This line is copied from the preexisting version of the same code in
close_loose_object.
If I'm understanding it correctly, the entire chain of messages is
untranslated and would
remain as english.  fsync_or_die doesn't have a _().  Can we just
leave it that way, since
this is not a situation that should actually happen to many users?
Alternatively, I think it
would be pretty trivial to just pass through the file name, so I'll
just do that.

> > diff --git a/cache.h b/cache.h
> > index 3160bc1e489..d1ae51388c9 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -1040,7 +1040,8 @@ extern int use_fsync;
> >
> >  enum fsync_method {
> >       FSYNC_METHOD_FSYNC,
> > -     FSYNC_METHOD_WRITEOUT_ONLY
> > +     FSYNC_METHOD_WRITEOUT_ONLY,
> > +     FSYNC_METHOD_BATCH
> >  };
> >
> >  extern enum fsync_method fsync_method;
> > @@ -1767,6 +1768,11 @@ void fsync_or_die(int fd, const char *);
> >  int fsync_component(enum fsync_component component, int fd);
> >  void fsync_component_or_die(enum fsync_component component, int fd, const char *msg);
> >
> > +static inline int batch_fsync_enabled(enum fsync_component component)
> > +{
> > +     return (fsync_components & component) && (fsync_method == FSYNC_METHOD_BATCH);
> > +}
> > +
> >  ssize_t read_in_full(int fd, void *buf, size_t count);
> >  ssize_t write_in_full(int fd, const void *buf, size_t count);
> >  ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
> > diff --git a/config.c b/config.c
> > index 261ee7436e0..0b28f90de8b 100644
> > --- a/config.c
> > +++ b/config.c
> > @@ -1688,6 +1688,8 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
> >                       fsync_method = FSYNC_METHOD_FSYNC;
> >               else if (!strcmp(value, "writeout-only"))
> >                       fsync_method = FSYNC_METHOD_WRITEOUT_ONLY;
> > +             else if (!strcmp(value, "batch"))
> > +                     fsync_method = FSYNC_METHOD_BATCH;
> >               else
> >                       warning(_("ignoring unknown core.fsyncMethod value '%s'"), value);
> >
> > diff --git a/object-file.c b/object-file.c
> > index 5258d9ed827..bdb0a38328f 100644
> > --- a/object-file.c
> > +++ b/object-file.c
> > @@ -1895,6 +1895,8 @@ static void close_loose_object(int fd)
> >
> >       if (fsync_object_files > 0)
> >               fsync_or_die(fd, "loose object file");
> > +     else if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
> > +             fsync_loose_object_bulk_checkin(fd);
> >       else
> >               fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd,
> >                                      "loose object file");
>
> This is related to the above comments about what minimum set of syscalls
> are needed to trigger this "bulk" behavior, but it seems to me that this
> whole API is avoiding just passing some new flags down to object-file.c
> and friends.
>
> For e.g. update-index that results in e.g. the "plug bulk" not being
> aware of HASH_WRITE_OBJECT, so with dry-run writes and the like we'll do
> the whole setup/teardown for nothing.
>
> Which is another reason I wondered why this couldn't be a flagged passed
> down to the object writing...

In the original implementation [1], I did some custom thing for
renaming the files
rather than tmp_objdir. But you suggested at the time that I use
tmp_objdir, which
was a good decision, since it made access to the objects possible in-process and
for descendents in the middle of the transaction.

It sounds to me like I just shouldn't plug the bulk checkin for cases
where we're not
going to add to the ODB.  Plugging the bulk checkin is always
optional.  But when
I wrote the code, I didn't love the result, since it makes arbitrary
callers harder. So
I changed the code to lazily create the tmp objdir the first time an object
shows up, which has the same effect of avoiding the cost when we
aren't adding any
objects.  This also avoids the need to write an error message, since
failing to create
the tmp objdir will just result in a fsync.  The main downside here is
that it's another thing
that will have to change if we want to make adding to the ODB multithreaded.

Thanks,
Neeraj

[1] https://lore.kernel.org/all/12cad737635663ed596e52f89f0f4f22f58bfe38.1632176111.git.gitgitgadget@gmail.com/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Mon, Mar 21 2022, Neeraj Singh wrote:

[Don't have time for a full reply, sorry, just something quick]

> On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
> [...]
>> So, my question is still why the temporary object dir migration part of
>> this is needed.
>>
>> We are writing N loose object files, and we write those to temporary
>> names already.
>>
>> AFAIKT we could do all of this by doing the same
>> tmp/rename/sync_file_range dance on the main object store.
>>
>
> Why not the main object store? We want to maintain the invariant that any
> name in the main object store refers to a file that durably has the
> correct contents.
> If we do sync_file_range and then rename, and then crash, we now have a file
> in the main object store with some SHA name, whose contents may or may not
> match the SHA.  However, if we ensure an fsync happens before the rename,
> a crash at any point will leave us either with no file in the main
> object store or
> with a file that is durable on the disk.

Ah, I see.

Why does that matter? If the "bulk" mode works as advertised we might
have such a corrupt loose or pack file, but we won't have anything
referring to it as far as reachability goes.

I'm aware that the various code paths that handle OID writing don't deal
too well with it in practice to say the least, which one can try with
say:

    $ echo foo | git hash-object -w --stdin
    45b983be36b73c0788dc9cbcb76cbb80fc7bb057
    $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057

I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running
that hash-object again even returns successful (we see it's there
already, and think it's OK).

But in any case, I think it would me much easier to both review and
reason about this code if these concerns were split up.

I.e. things that want no fsync at all (I'd think especially so) might
want to have such updates serialized in this manner, and as Junio
pointed out making these things inseparable as you've done creates API
concerns & fallout that's got nothing to do with what we need for the
performance gains of the bulk checkin fsyncing technique,
e.g. concurrent "update-index" consumers not being able to assume
reported objects exist as soon as they're reported.

>> Then instead of the "bulk_fsync" cookie file don't close() the last file
>> object file we write until we issue the fsync on it.
>>
>> But maybe this is all needed, I just can't understand from the commit
>> message why the "bulk checkin" part is being done.
>>
>> I think since we've been over this a few times without any success it
>> would really help to have some example of the smallest set of syscalls
>> to write a file like this safely. I.e. this is doing (pseudocode):
>>
>>     /* first the bulk path */
>>     open("bulk/x.tmp");
>>     write("bulk/x.tmp");
>>     sync_file_range("bulk/x.tmp");
>>     close("bulk/x.tmp");
>>     rename("bulk/x.tmp", "bulk/x");
>>     open("bulk/y.tmp");
>>     write("bulk/y.tmp");
>>     sync_file_range("bulk/y.tmp");
>>     close("bulk/y.tmp");
>>     rename("bulk/y.tmp", "bulk/y");
>>     /* Rename to "real" */
>>     rename("bulk/x", x");
>>     rename("bulk/y", y");
>>     /* sync a cookie */
>>     fsync("cookie");
>>
>
> The '/* Rename to "real" */' and '/* sync a cookie */' steps are
> reversed in your above sequence. It should be

Sorry.

> 1: (for each file)
>     a) open
>     b) write
>     c) sync_file_range
>     d) close
>     e) rename in tmp_objdir  -- The rename step is not required for
> bulk-fsync. An earlier version of this series didn't do it, but
> Jeff King pointed out that it was required for concurrency:
> https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@coredump.intra.peff.net/

Yes we definitely need the rename, I was wondering about why we needed
it 2x for each file, but that was answered above.

>> And I'm asking why it's not:
>>
>>     /* Rename to "real" as we go */
>>     open("x.tmp");
>>     write("x.tmp");
>>     sync_file_range("x.tmp");
>>     close("x.tmp");
>>     rename("x.tmp", "x");
>>     last_fd = open("y.tmp"); /* don't close() the last one yet */
>>     write("y.tmp");
>>     sync_file_range("y.tmp");
>>     rename("y.tmp", "y");
>>     /* sync a cookie */
>>     fsync(last_fd);
>>
>> Which I guess is two questions:
>>
>>  A. do we need the cookie, or can we re-use the fd of the last thing we
>>     write?
>
> We can re-use the FD of the last thing we write, but that results in a
> tricker API which
> is more intrusive on callers. I was originally using a lockfile, but
> found a usage where
> there was no lockfile in unpack-objects.

Ok, so it's something we could do, but passing down 2-3 functions to
object-file.c was a hassle.

I tried to hack that up earlier and found that it wasn't *too
bad*. I.e. we'd pass some "flags" about our intent, and amend various
functions to take "don't close this one" and pass up the fd (or even do
that as a global).

In any case, having the commit message clearly document what's needed
for what & what's essential & just shortcut taken for the convenience of
the current implementation would be really useful.

Then we can always e.g. change this later to just do the the fsync() on
the last of N we write.

[Ran out of time, sorry]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Mon, Mar 21, 2022 at 1:37 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Mon, Mar 21 2022, Neeraj Singh wrote:
>
> [Don't have time for a full reply, sorry, just something quick]
>
> > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
> > [...]
> >> So, my question is still why the temporary object dir migration part of
> >> this is needed.
> >>
> >> We are writing N loose object files, and we write those to temporary
> >> names already.
> >>
> >> AFAIKT we could do all of this by doing the same
> >> tmp/rename/sync_file_range dance on the main object store.
> >>
> >
> > Why not the main object store? We want to maintain the invariant that any
> > name in the main object store refers to a file that durably has the
> > correct contents.
> > If we do sync_file_range and then rename, and then crash, we now have a file
> > in the main object store with some SHA name, whose contents may or may not
> > match the SHA.  However, if we ensure an fsync happens before the rename,
> > a crash at any point will leave us either with no file in the main
> > object store or
> > with a file that is durable on the disk.
>
> Ah, I see.
>
> Why does that matter? If the "bulk" mode works as advertised we might
> have such a corrupt loose or pack file, but we won't have anything
> referring to it as far as reachability goes.
>
> I'm aware that the various code paths that handle OID writing don't deal
> too well with it in practice to say the least, which one can try with
> say:
>
>     $ echo foo | git hash-object -w --stdin
>     45b983be36b73c0788dc9cbcb76cbb80fc7bb057
>     $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057
>
> I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running
> that hash-object again even returns successful (we see it's there
> already, and think it's OK).
>

I was under the impression that in-practice a corrupt loose-object can create
persistent problems in the repo for future commands, since we might not
aggressively verify that an existing file with a certain OID really is
valid when
adding a new instance of the data with the same OID.

If you don't have an fsync barrier before producing the final
content-addressable
name, you can't reason about "this operation happened before that operation,"
so it wouldn't really be valid to say that "we won't have anything
referring to it as far
as reachability goes."

It's entirely possible that you'd have trees pointing to other trees
or blobs that aren't
valid, since data writes can be durable in any order. At this point,
future attempts add
the same blobs or trees might silently drop the updates.  I'm betting that's why
core.fsyncObjectFiles was added in the first place, since someone
observed severe
persistent consequences for this form of corruption.

> But in any case, I think it would me much easier to both review and
> reason about this code if these concerns were split up.
>
> I.e. things that want no fsync at all (I'd think especially so) might
> want to have such updates serialized in this manner, and as Junio
> pointed out making these things inseparable as you've done creates API
> concerns & fallout that's got nothing to do with what we need for the
> performance gains of the bulk checkin fsyncing technique,
> e.g. concurrent "update-index" consumers not being able to assume
> reported objects exist as soon as they're reported.
>

I want to explicitly not respond to this concern. I don't believe this
100 line patch
can be usefully split.

> >> Then instead of the "bulk_fsync" cookie file don't close() the last file
> >> object file we write until we issue the fsync on it.
> >>
> >> But maybe this is all needed, I just can't understand from the commit
> >> message why the "bulk checkin" part is being done.
> >>
> >> I think since we've been over this a few times without any success it
> >> would really help to have some example of the smallest set of syscalls
> >> to write a file like this safely. I.e. this is doing (pseudocode):
> >>
> >>     /* first the bulk path */
> >>     open("bulk/x.tmp");
> >>     write("bulk/x.tmp");
> >>     sync_file_range("bulk/x.tmp");
> >>     close("bulk/x.tmp");
> >>     rename("bulk/x.tmp", "bulk/x");
> >>     open("bulk/y.tmp");
> >>     write("bulk/y.tmp");
> >>     sync_file_range("bulk/y.tmp");
> >>     close("bulk/y.tmp");
> >>     rename("bulk/y.tmp", "bulk/y");
> >>     /* Rename to "real" */
> >>     rename("bulk/x", x");
> >>     rename("bulk/y", y");
> >>     /* sync a cookie */
> >>     fsync("cookie");
> >>
> >
> > The '/* Rename to "real" */' and '/* sync a cookie */' steps are
> > reversed in your above sequence. It should be
>
> Sorry.
>
> > 1: (for each file)
> >     a) open
> >     b) write
> >     c) sync_file_range
> >     d) close
> >     e) rename in tmp_objdir  -- The rename step is not required for
> > bulk-fsync. An earlier version of this series didn't do it, but
> > Jeff King pointed out that it was required for concurrency:
> > https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@coredump.intra.peff.net/
>
> Yes we definitely need the rename, I was wondering about why we needed
> it 2x for each file, but that was answered above.
>
> >> And I'm asking why it's not:
> >>
> >>     /* Rename to "real" as we go */
> >>     open("x.tmp");
> >>     write("x.tmp");
> >>     sync_file_range("x.tmp");
> >>     close("x.tmp");
> >>     rename("x.tmp", "x");
> >>     last_fd = open("y.tmp"); /* don't close() the last one yet */
> >>     write("y.tmp");
> >>     sync_file_range("y.tmp");
> >>     rename("y.tmp", "y");
> >>     /* sync a cookie */
> >>     fsync(last_fd);
> >>
> >> Which I guess is two questions:
> >>
> >>  A. do we need the cookie, or can we re-use the fd of the last thing we
> >>     write?
> >
> > We can re-use the FD of the last thing we write, but that results in a
> > tricker API which
> > is more intrusive on callers. I was originally using a lockfile, but
> > found a usage where
> > there was no lockfile in unpack-objects.
>
> Ok, so it's something we could do, but passing down 2-3 functions to
> object-file.c was a hassle.
>
> I tried to hack that up earlier and found that it wasn't *too
> bad*. I.e. we'd pass some "flags" about our intent, and amend various
> functions to take "don't close this one" and pass up the fd (or even do
> that as a global).
>
> In any case, having the commit message clearly document what's needed
> for what & what's essential & just shortcut taken for the convenience of
> the current implementation would be really useful.
>
> Then we can always e.g. change this later to just do the the fsync() on
> the last of N we write.
>

I left a comment in the (now very long) commit message that indicates the
dummy file is there to make the API simpler.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Mon, Mar 21 2022, Neeraj Singh wrote:

> On Mon, Mar 21, 2022 at 1:37 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Mon, Mar 21 2022, Neeraj Singh wrote:
>>
>> [Don't have time for a full reply, sorry, just something quick]
>>
>> > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
>> > [...]
>> >> So, my question is still why the temporary object dir migration part of
>> >> this is needed.
>> >>
>> >> We are writing N loose object files, and we write those to temporary
>> >> names already.
>> >>
>> >> AFAIKT we could do all of this by doing the same
>> >> tmp/rename/sync_file_range dance on the main object store.
>> >>
>> >
>> > Why not the main object store? We want to maintain the invariant that any
>> > name in the main object store refers to a file that durably has the
>> > correct contents.
>> > If we do sync_file_range and then rename, and then crash, we now have a file
>> > in the main object store with some SHA name, whose contents may or may not
>> > match the SHA.  However, if we ensure an fsync happens before the rename,
>> > a crash at any point will leave us either with no file in the main
>> > object store or
>> > with a file that is durable on the disk.
>>
>> Ah, I see.
>>
>> Why does that matter? If the "bulk" mode works as advertised we might
>> have such a corrupt loose or pack file, but we won't have anything
>> referring to it as far as reachability goes.
>>
>> I'm aware that the various code paths that handle OID writing don't deal
>> too well with it in practice to say the least, which one can try with
>> say:
>>
>>     $ echo foo | git hash-object -w --stdin
>>     45b983be36b73c0788dc9cbcb76cbb80fc7bb057
>>     $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057
>>
>> I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running
>> that hash-object again even returns successful (we see it's there
>> already, and think it's OK).
>>
>
> I was under the impression that in-practice a corrupt loose-object can create
> persistent problems in the repo for future commands, since we might not
> aggressively verify that an existing file with a certain OID really is
> valid when
> adding a new instance of the data with the same OID.

Yes, it can. As the hash-object case shows we don't even check at all.

For "incoming push" we *will* notice, but will just uselessly error
out.

I actually had some patches a while ago to turn off our own home-grown
SHA-1 collision checking.

It had the nice side effect of making it easier to recover from loose
object corruption, since you could (re-)push the corrupted OID as a
PACK, we wouldn't check (and die) on the bad loose object, and since we
take a PACK over LOOSE we'd recover:
https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/

> If you don't have an fsync barrier before producing the final
> content-addressable
> name, you can't reason about "this operation happened before that operation,"
> so it wouldn't really be valid to say that "we won't have anything
> referring to it as far
> as reachability goes."

That's correct, but we're discussing a feature that *does have* that
fsync barrier. So if we get an error while writing the loose objects
before the "cookie" fsync we'll presumably error out. That'll then be
followed by an fsync() of whatever makes the objects reachable.

> It's entirely possible that you'd have trees pointing to other trees
> or blobs that aren't
> valid, since data writes can be durable in any order. At this point,
> future attempts add
> the same blobs or trees might silently drop the updates.  I'm betting that's why
> core.fsyncObjectFiles was added in the first place, since someone
> observed severe
> persistent consequences for this form of corruption.

Well, you can see Linus's original rant-as-documentation for why we
added it :) I.e. the original git implementation made some heavy
linux-FS assumption about the order of writes and an fsync() flushing
any previous writes, which wasn't portable.

>> But in any case, I think it would me much easier to both review and
>> reason about this code if these concerns were split up.
>>
>> I.e. things that want no fsync at all (I'd think especially so) might
>> want to have such updates serialized in this manner, and as Junio
>> pointed out making these things inseparable as you've done creates API
>> concerns & fallout that's got nothing to do with what we need for the
>> performance gains of the bulk checkin fsyncing technique,
>> e.g. concurrent "update-index" consumers not being able to assume
>> reported objects exist as soon as they're reported.
>>
>
> I want to explicitly not respond to this concern. I don't believe this
> 100 line patch
> can be usefully split.

Leaving "usefully" aside for a second (since that's subjective), it
clearly "can". I just tried this on top of "seen":

	diff --git a/bulk-checkin.c b/bulk-checkin.c
	index a702e0ff203..9e994c4d6ae 100644
	--- a/bulk-checkin.c
	+++ b/bulk-checkin.c
	@@ -9,15 +9,12 @@
	 #include "pack.h"
	 #include "strbuf.h"
	 #include "string-list.h"
	-#include "tmp-objdir.h"
	 #include "packfile.h"
	 #include "object-store.h"
	 
	 static int bulk_checkin_plugged;
	 static int needs_batch_fsync;
	 
	-static struct tmp_objdir *bulk_fsync_objdir;
	-
	 static struct bulk_checkin_state {
	 	char *pack_tmp_name;
	 	struct hashfile *f;
	@@ -110,11 +107,6 @@ static void do_batch_fsync(void)
	 		strbuf_release(&temp_path);
	 		needs_batch_fsync = 0;
	 	}
	-
	-	if (bulk_fsync_objdir) {
	-		tmp_objdir_migrate(bulk_fsync_objdir);
	-		bulk_fsync_objdir = NULL;
	-	}
	 }
	 
	 static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
	@@ -321,7 +313,6 @@ void fsync_loose_object_bulk_checkin(int fd)
	 	 */
	 	if (bulk_checkin_plugged &&
	 	    git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
	-		assert(bulk_fsync_objdir);
	 		if (!needs_batch_fsync)
	 			needs_batch_fsync = 1;
	 	} else {
	@@ -343,19 +334,6 @@ int index_bulk_checkin(struct object_id *oid,
	 void plug_bulk_checkin(void)
	 {
	 	assert(!bulk_checkin_plugged);
	-
	-	/*
	-	 * A temporary object directory is used to hold the files
	-	 * while they are not fsynced.
	-	 */
	-	if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
	-		bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
	-		if (!bulk_fsync_objdir)
	-			die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
	-
	-		tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
	-	}
	-
	 	bulk_checkin_plugged = 1;
	 }

And then tried running:

    $ GIT_PERF_MAKE_OPTS='CFLAGS=-O3' ./run HEAD~ HEAD -- p3900-stash.sh	

And got:
    
    Test                                              HEAD~              HEAD
    --------------------------------------------------------------------------------------------
    3900.2: stash 500 files (object_fsyncing=false)   0.56(0.08+0.09)    0.60(0.08+0.08) +7.1%
    3900.4: stash 500 files (object_fsyncing=true)    14.50(0.07+0.15)   17.13(0.10+0.12) +18.1%
    3900.6: stash 500 files (object_fsyncing=batch)   1.14(0.08+0.11)    1.03(0.08+0.10) -9.6%

Now, I really don't trust that perf run to say anything except these
being in the same ballpark, but it's clearly going to be a *bit* faster
since we'll be doing fewer IOps.

As to "usefully" I really do get what you're saying that you only find
these useful when you combine the two because you'd like to have 100%
safety, and that's fair enough.

But since we are going to have a knob to turn off fsyncing entirely, and
we have this "bulk" mode which requires you to carefully reason about
your FS semantics to ascertain safety the performance/safety trade-off
is clearly something that's useful to have tweaks for.

And with "bulk" the concern about leaving behind stray corrupt objects
is entirely orthagonal to corcerns about losing a ref update, which is
the main thing we're worried about.

I also don't see how even if you're arguing that nobody would want one
without the other because everyone who cares about "bulk" cares about
this stray-corrupt-loose-but-no-ref-update case, how it has any business
being tied up in the "bulk" mode as far as the implementation goes.

That's because the same edge case is exposed by
core.fsyncObjectFiles=false for those who are assuming the initial
"ordered" semantics.

I.e. if we're saying that before we write the ref we'd like to not
expose the WIP objects in the primary object store because they're not
fsync'd yet, how is that mode different than "bulk" if we crash while
doing that operation (before the eventual fsync()).

So I really think it's much better to split these concerns up.

I think even if you insist on the same end-state it makes the patch
progression much *easier* to reason about. We'd then solve one problem
at a time, and start with a commit where just the semantics that are
unique to "bulk" are implemented, with nothing else conflated with
those.

> [...]
>> Ok, so it's something we could do, but passing down 2-3 functions to
>> object-file.c was a hassle.
>>
>> I tried to hack that up earlier and found that it wasn't *too
>> bad*. I.e. we'd pass some "flags" about our intent, and amend various
>> functions to take "don't close this one" and pass up the fd (or even do
>> that as a global).
>>
>> In any case, having the commit message clearly document what's needed
>> for what & what's essential & just shortcut taken for the convenience of
>> the current implementation would be really useful.
>>
>> Then we can always e.g. change this later to just do the the fsync() on
>> the last of N we write.
>>
>
> I left a comment in the (now very long) commit message that indicates the
> dummy file is there to make the API simpler.

In terms of more understandable progression I also think this series
would be much easier to understand if it converted one caller without
needing the "cookie" where doing so is easy, e.g. the unpack-objects.c
caller where we're processing nr_objects, so we can just pass down a
flag to do the fsync() for i == nr_objects.

That'll then clearly show that the whole business of having the global
state on the side is just a replacement for passing down such a flag.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

I'm going to respond in more detail to your individual patches,
(expect the last mail to contain a comment at the end "LAST MAIL").

On Wed, Mar 23, 2022 at 3:52 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Tue, Mar 22 2022, Neeraj Singh wrote:
>
> > On Tue, Mar 22, 2022 at 8:48 PM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >> As with unpack-objects in a preceding commit have update-index.c make
> >> use of the HASH_N_OBJECTS{,_{FIRST,LAST}} flags. We now have a "batch"
> >> mode again for "update-index".
> >>
> >> Adding the t/* directory from git.git on a Linux ramdisk is a bit
> >> faster than with the tmp-objdir indirection:
> >>
> >>         git hyperfine -L rev ns/batched-fsync,HEAD -s 'make CFLAGS=-O3' -p 'rm -rf repo && git init repo && cp -R t repo/' 'git ls-files -- t | ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin' --warmup 1 -r 10
> >>         Benchmark 1: git ls-files -- t | ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin' in 'ns/batched-fsync
> >>           Time (mean ± σ):     289.8 ms ±   4.0 ms    [User: 186.3 ms, System: 103.2 ms]
> >>           Range (min … max):   285.6 ms … 297.0 ms    10 runs
> >>
> >>         Benchmark 2: git ls-files -- t | ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin' in 'HEAD
> >>           Time (mean ± σ):     273.9 ms ±   7.3 ms    [User: 189.3 ms, System: 84.1 ms]
> >>           Range (min … max):   267.8 ms … 291.3 ms    10 runs
> >>
> >>         Summary
> >>           'git ls-files -- t | ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin' in 'HEAD' ran
> >>             1.06 ± 0.03 times faster than 'git ls-files -- t | ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin' in 'ns/batched-fsync'
> >>
> >> And as before running that with "strace --summary-only" slows things
> >> down a bit (probably mimicking slower I/O a bit). I then get:
> >>
> >>         Summary
> >>           'git ls-files -- t | strace --summary-only ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin' in 'HEAD' ran
> >>             1.21 ± 0.02 times faster than 'git ls-files -- t | strace --summary-only ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin' in 'ns/batched-fsync'
> >>
> >> We also go from ~51k syscalls to ~39k, with ~2x the number of link()
> >> and unlink() in ns/batched-fsync.
> >>
> >> In the process of doing this conversion we lost the "bulk" mode for
> >> files added on the command-line. I don't think it's useful to optimize
> >> that, but we could if anyone cared.
> >>
> >> We've also converted this to a string_list, we could walk with
> >> getline_fn() and get one line "ahead" to see what we have left, but I
> >> found that state machine a bit painful, and at least in my testing
> >> buffering this doesn't harm things. But we could also change this to
> >> stream again, at the cost of some getline_fn() twiddling.
> >>
> >> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> >> ---
> >>  builtin/update-index.c | 31 +++++++++++++++++++++++++++----
> >>  1 file changed, 27 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/builtin/update-index.c b/builtin/update-index.c
> >> index af02ff39756..c7cbfe1123b 100644
> >> --- a/builtin/update-index.c
> >> +++ b/builtin/update-index.c
> >> @@ -1194,15 +1194,38 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >>         }
> >>
> >>         if (read_from_stdin) {
> >> +               struct string_list list = STRING_LIST_INIT_NODUP;
> >>                 struct strbuf line = STRBUF_INIT;
> >>                 struct strbuf unquoted = STRBUF_INIT;
> >> +               size_t i, nr;
> >> +               unsigned oflags;
> >>
> >>                 setup_work_tree();
> >> -               while (getline_fn(&line, stdin) != EOF)
> >> -                       line_from_stdin(&line, &unquoted, prefix, prefix_length,
> >> -                                       nul_term_line, set_executable_bit, 0);
> >> +               while (getline_fn(&line, stdin) != EOF) {
> >> +                       size_t len = line.len;
> >> +                       char *str = strbuf_detach(&line, NULL);
> >> +
> >> +                       string_list_append_nodup(&list, str)->util = (void *)len;
> >> +               }
> >> +
> >> +               nr = list.nr;
> >> +               oflags = nr > 1 ? HASH_N_OBJECTS : 0;
> >> +               for (i = 0; i < nr; i++) {
> >> +                       size_t nth = i + 1;
> >> +                       unsigned f = i == 0 ? HASH_N_OBJECTS_FIRST :
> >> +                                 nr == nth ? HASH_N_OBJECTS_LAST : 0;
> >> +                       struct strbuf buf = STRBUF_INIT;
> >> +                       struct string_list_item *item = list.items + i;
> >> +                       const size_t len = (size_t)item->util;
> >> +
> >> +                       strbuf_attach(&buf, item->string, len, len);
> >> +                       line_from_stdin(&buf, &unquoted, prefix, prefix_length,
> >> +                                       nul_term_line, set_executable_bit,
> >> +                                       oflags | f);
> >> +                       strbuf_release(&buf);
> >> +               }
> >>                 strbuf_release(&unquoted);
> >> -               strbuf_release(&line);
> >> +               string_list_clear(&list, 0);
> >>         }
> >>
> >>         if (split_index > 0) {
> >> --
> >> 2.35.1.1428.g1c1a0152d61
> >>
> >
> > This buffering introduces the same potential risk of the
> > "stdin-feeder" process not being able to see objects right away as my
> > version had. I'm planning to mitigate the issue by unplugging the bulk
> > checkin when issuing a verbose report so that anyone who's using that
> > output to synchronize can still see what they're expecting.
>
> I was rather terse in the commit message, I meant (but forgot some
> words) "doesn't harm thing for performance [in the above test]", but
> converting this to a string_list is clearly & regression that shouldn't
> be kept.
>
> I just wanted to demonstrate method of doing this by passing down the
> HASH_* flags, and found that writing the state-machine to "buffer ahead"
> by one line so that we can eventually know in the loop if we're in the
> "last" line or not was tedious, so I came up with this POC. But we
> clearly shouldn't lose the "streaming" aspect.
>

From my experience working on several state machines in the Windows
OS, they are notoriously difficult to understand and extend.  I
wouldn't want every top-level command that does something interesting
to have to deal with that.

> But anyway, now that I look at this again the smart thing here (surely?)
> is to keep the simple getline() loop and not ever issue a
> HASH_N_OBJECTS_LAST for the Nth item, instead we should in this case do
> the "checkpoint fsync" at the point that we write the actual index.
>
> Because an existing redundancy in your series is that you'll do the
> fsync() the same way for "git unpack-objects" as for "git
> {update-index,add}".
>
> I.e. in the former case adding the N objects is all we're doing, so the
> "last object" is the point at which we need to flush the previous N to
> disk.
>
> But for "update-index/add" you'll do at least 2 fsync()'s in the bulk
> mode, when it should be one. I.e. the equivalent of (leaving aside the
> tmp-objdir migration part of it), if writing objects A && B:
>
>     ## METHOD ONE
>     # A
>     write(objects/A.tmp)
>     bulk_fsync(objects/A.tmp)
>     rename(objects/A.tmp, objects/A)
>     # B
>     write(objects/B.tmp)
>     bulk_fsync(objects/B.tmp)
>     rename(objects/B.tmp, objects/B)
>     # "cookie"
>     write(bulk_fsync_XXXXXX)
>     fsync(bulk_fsync_XXXXXX)
>     # ref
>     write(INDEX.tmp, $(git rev-parse B))
>     fsync(INDEX.tmp)
>     rename(INDEX.tmp, INDEX)
>
> This series on top changes that so we know that we're doing N, so we
> don't need the seperate "cookie", we can just use the B object as the
> cookie, as we know it comes last;
>
>     ## METHOD TWO
>     # A -- SAME as above
>     write(objects/A.tmp)
>     bulk_fsync(objects/A.tmp)
>     rename(objects/A.tmp, objects/A)
>     # B -- SAME as above, with s/bulk_fsync/fsync/
>     write(objects/B.tmp)
>     fsync(objects/B.tmp)
>     rename(objects/B.tmp, objects/B)
>     # "cookie" -- GONE!
>     # ref -- SAME
>     write(INDEX.tmp, $(git rev-parse B))
>     fsync(INDEX.tmp)
>     rename(INDEX.tmp, INDEX)
>
> But really, we should instead realize that we're not doing
> "unpack-objects", but have a "ref update" at the end (whether that's a
> ref, or an index etc.) and do:
>
>     ## METHOD THREE
>     # A -- SAME as above
>     write(objects/A.tmp)
>     bulk_fsync(objects/A.tmp)
>     rename(objects/A.tmp, objects/A)
>     # B -- SAME as the first
>     write(objects/B.tmp)
>     bulk_fsync(objects/B.tmp)
>     rename(objects/B.tmp, objects/B)
>     # ref -- SAME
>     write(INDEX.tmp, $(git rev-parse B))
>     fsync(INDEX.tmp)
>     rename(INDEX.tmp, INDEX)
>
> Which cuts our number of fsync() operations down from 2 to 1, ina
> addition to removing the need for the "cookie", which is only there
> because we didn't keep track of where we were in the sequence as in my
> 2/7 and 5/7.
>

I agree that this is a great direction to go in as an extension to
this work (i.e. a subsequent patch).  I saw in one of your mails on v2
of your rfc series that you mentioned a "lightweight transaction-y
thing".  I've been thinking along the same lines myself, but wanted to
treat that as a separable concern.  In my ideal world, we'd just use a
real database for loose objects, the index, and refs and let that
handle the transaction management.  But in lieu of that, having a
transaction that looks across the ODB, index, and refs would let us
batch syncs optimally.

> And it would be the same for tmp-objdir, the rename dance is a bit
> different, but we'd do the "full" fsync() while on the INDEX.tmp, then
> migrate() the tmp-objdir, and once that's done do the final:
>
>     rename(INDEX.tmp, INDEX)
>
> I.e. we'd fsync() the content once, and only have the renme() or link()
> operations left. For POSIX we'd need a few more fsync() for the
> metadata, but this (i.e. your) series already makes the hard assumption
> that we don't need to do that for rename().
>
> > I think the code you've presented here is a lot of diff to accomplish
> > the same thing that my series does, where this specific update-index
> > caller has been roto-tilled to provide the needed
> > begin/end-transaction points.
>
> Any caller of these APIs will need the "unsigned oflags" sooner than
> later anyway, as they need to pass down e.g. HASH_WRITE_OBJECT. We just
> do it slightly earlier.
>
> And because of that in the general case it's really not the same, I
> think it's a better approach. You've already got the bug in yours of
> needlessly setting up the bulk checkin for !HASH_WRITE_OBJECT in
> update-index, which this neatly solves by deferring the "bulk" mechanism
> until the codepath that's past that and into the "real" object writing.
>
> We can also die() or error out in the object writing before ever getting
> to writing the object, in which case we'd do some setup that we'd need
> to tear down again, by deferring it until the last moment...
>

I'll be submitting a new version to the list which sets up the tmp
objdir lazily on first actual write, so the concern about writing to
the ODB needlessly should go away.

> > And I think there will be a lot of
> > complexity in supporting the same hints for command-line additions
> > (which is roughly equivalent to the git-add workflow).
>
> I left that out due to Junio's comment in
> https://lore.kernel.org/git/xmqqzgljyz34.fsf@gitster.g/; i.e. I don't
> see why we'd find it worthwhile to optimize that case, but we easily
> could (especially per the "just sync the INDEX.tmp" above).
>
> But even if we don't do "THREE" above I think it's still easy, for "TWO"
> we already have as parse_options() state machine to parse argv as it
> comes in. Doing the fsync() on the last object is just a matter of
> "looking ahead" there).
>
> > Every caller
> > that wants batch treatment will have to either implement a state
> > machine or implement a buffering mechanism in order to figure out the
> > begin-end points. Having a separate plug/unplug call eliminates this
> > complexity on each caller.
>
> This is subjective, but I really think that's rather easy to do, and
> much easier to reason about than the global state on the side via
> singletons that your method of avoiding modifying these callers and
> instead having them all consult global state via bulk-checkin.c and
> cache.h demands.

The nice thing about having the ODB handle the batch stuff internally
is that it can present a nice minimal interface to all of the callers.
Yes, it has a complex implementation internally, but that complexity
backs a rather simple API surface:
1. Begin/end transaction (plug/unplug checkin).
2. Find-object by SHA
3. Add object if it doesn't exist
4. Get the SHA without adding anything.

The ODB work is implemented once and callers can easily adopt the
transaction API without having to implement their own stuff on the
side.  Future series can make the transaction span nicely across the
ODB, index, and refs.

> That API also currently assumes single-threaded writers, if we start
> writing some of this in parallel in e.g. "unpack-objects" we'd need
> mutexes in bulk-object.[ch]. Isn't that a lot easier when the caller
> would instead know something about the special nature of the transaction
> they're interacting with, and that the 1st and last item are important
> (for a "BEGIN" and "FLUSH").
>

The API as sketched above doesn't deeply assume single-threadedness
for the "find object by SHA" or "add object if it doesn't exist".
There is a single-threaded assumption for begin/end-transaction.  The
implementation can use pthread_once to handle anything that needs to
be done lazily when adding objects.

> > Btw, I'm planning in a future series to reduce the system calls
> > involved in renaming a file by taking advantage of the renameat2
> > system call and equivalents on other platforms.  There's a pretty
> > strong motivation to do that on Windows.
>
> What do you have in mind for renameat2() specifically?  I.e. which of
> the 3x flags it implements will benefit us? RENAME_NOREPLACE to "move"
> the tmp_OBJ to an eventual OBJ?
>

Yes RENAME_NOREPLACE.  I'd want to introduce a helper called
git_rename_noreplace and use it instead of the link dance.

> Generally: There's some low-hanging fruit there. E.g. for tmp-objdir we
> slavishly go through the motion of writing an tmp_OBJ, writing (and
> possibly syncing it), then renaming that tmp_OBJ to OBJ.
>
> We could clearly just avoid that in some/all cases that use
> tmp-objdir. I.e. we're writing to a temporary store anyway, so why the
> tmp_OBJ files? We could just write to the final destinations instead,
> they're not reachable (by ref or OID lookup) from anyone else yet.
>

We were thinking before that there could be some concurrency in the
tmp_objdir, though I personally don't believe it's possible for the
typical bulk checkin case.  Using the final name in the tmp objdir
would be a nice optimization, but I think that it's a separable
concern that shouldn't block the bigger win from eliminating the cache
flushes.

> But even then I don't see how you'd get away with reducing some classes
> of syscalls past the 2x increase for some (leading an overall increase,
> but not a ~2x overall increase as noted in:
> https://lore.kernel.org/git/RFC-patch-7.7-481f1d771cb-20220323T033928Z-avarab@gmail.com/)
> as long as you use the tmp-objdir API. It's always going to have to
> write tmpdir/OBJ and link()/rename() that to OBJ.
>
> Now, I do think there's an easy way by extending the API use I've
> introduced in this RFC to do it. I.e. we'd just do:
>
>     ## METHOD FOUR
>     # A -- SAME as THREE, except no rename()
>     write(objects/A.tmp)
>     bulk_fsync(objects/A.tmp)
>     # B -- SAME as THREE, except no rename()
>     write(objects/B.tmp)
>     bulk_fsync(objects/B.tmp)
>     # ref -- SAME
>     write(INDEX.tmp, $(git rev-parse B))
>     fsync(INDEX.tmp)
>     # NEW: do all the renames at the end:
>     rename(objects/A.tmp, objects/A)
>     rename(objects/B.tmp, objects/B)
>     rename(INDEX.tmp, INDEX)
>
> That seems like an obvious win to me in any case. I.e. the tmp-objdir
> API isn't really a close fit for what we *really* want to do in this
> case.
>

I think this is the right place to get to eventually.  I believe the
best way to get there is to keep the plug/unplug bulk checkin
functionality (rebranding it as an 'ODB transaction') and then make
that a sub-transaction of a larger 'git repo transaction.'

> I.e. the reason it does everything this way is because it was explicitly
> designed for 722ff7f876c (receive-pack: quarantine objects until
> pre-receive accepts, 2016-10-03), where it's the right trade-off,
> because we'd like to cheaply "rm -rf" the whole thing if e.g. the
> "pre-receive" hook rejects the push.
>
> *AND* because it's made for the case of other things concurrently
> needing access to those objects. So pedantically you would need it for
> some modes of "git update-index", but not e.g. "git unpack-objects"
> where we really are expecting to keep all of them.
>
> > Thanks for the concrete code,
>
> ..but no thanks? I.e. it would be useful to explicitly know if you're
> interested or open to running with some of the approach in this RFC.

I'm still at the point of arguing with you about your RFC, but I'm
_not_ currently leaning toward adopting your approach.  I think from a
separation-of-concerns perspective, we shouldn't change top-level git
commands to try hard to track first/last object.  The ODB should
conceptually handle it internally as part of a higher-level
transaction.  Consider cmd_add, which does its interesting
add_file_to_index from the update_callback coming from the diff code:
I believe it would be hopelessly complex/impossible to do the tracking
required to pass the LAST_OF_N flag to a multiplexed write API.

We have a pretty clear example from the database world that
begin/end-transaction is the right way to design the API for the task
we want to accomplish.  It's also how many filesystems work
internally.  I don't want to reinvent the bicycle here.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 23, 2022 at 7:18 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> In preparation for making the bulk-checkin.c logic operate from
> object-file.c itself in some common cases let's add
> HASH_N_OBJECTS{,_{FIRST,LAST}} flags.
>
> This will allow us to adjust for-loops that add N objects to just pass
> down whether they have >1 objects (HASH_N_OBJECTS), as well as passing
> down flags for whether we have the first or last object.
>
> We'll thus be able to drive any sort of batch-object mechanism from
> write_object_file_flags() directly, which until now didn't know if it
> was doing one object, or some arbitrary N.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  builtin/unpack-objects.c | 60 +++++++++++++++++++++++-----------------
>  cache.h                  |  3 ++
>  2 files changed, 37 insertions(+), 26 deletions(-)
>
> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
> index c55b6616aed..ec40c6fd966 100644
> --- a/builtin/unpack-objects.c
> +++ b/builtin/unpack-objects.c
> @@ -233,7 +233,8 @@ static void write_rest(void)
>  }
>
>  static void added_object(unsigned nr, enum object_type type,
> -                        void *data, unsigned long size);
> +                        void *data, unsigned long size,
> +                        unsigned oflags);
>
>  /*
>   * Write out nr-th object from the list, now we know the contents
> @@ -241,21 +242,21 @@ static void added_object(unsigned nr, enum object_type type,
>   * to be checked at the end.
>   */
>  static void write_object(unsigned nr, enum object_type type,
> -                        void *buf, unsigned long size)
> +                        void *buf, unsigned long size, unsigned oflags)
>  {
>         if (!strict) {
> -               if (write_object_file(buf, size, type,
> -                                     &obj_list[nr].oid) < 0)
> +               if (write_object_file_flags(buf, size, type,
> +                                     &obj_list[nr].oid, oflags) < 0)
>                         die("failed to write object");
> -               added_object(nr, type, buf, size);
> +               added_object(nr, type, buf, size, oflags);
>                 free(buf);
>                 obj_list[nr].obj = NULL;
>         } else if (type == OBJ_BLOB) {
>                 struct blob *blob;
> -               if (write_object_file(buf, size, type,
> -                                     &obj_list[nr].oid) < 0)
> +               if (write_object_file_flags(buf, size, type,
> +                                           &obj_list[nr].oid, oflags) < 0)
>                         die("failed to write object");
> -               added_object(nr, type, buf, size);
> +               added_object(nr, type, buf, size, oflags);
>                 free(buf);
>
>                 blob = lookup_blob(the_repository, &obj_list[nr].oid);
> @@ -269,7 +270,7 @@ static void write_object(unsigned nr, enum object_type type,
>                 int eaten;
>                 hash_object_file(the_hash_algo, buf, size, type,
>                                  &obj_list[nr].oid);
> -               added_object(nr, type, buf, size);
> +               added_object(nr, type, buf, size, oflags);
>                 obj = parse_object_buffer(the_repository, &obj_list[nr].oid,
>                                           type, size, buf,
>                                           &eaten);
> @@ -283,7 +284,7 @@ static void write_object(unsigned nr, enum object_type type,
>
>  static void resolve_delta(unsigned nr, enum object_type type,
>                           void *base, unsigned long base_size,
> -                         void *delta, unsigned long delta_size)
> +                         void *delta, unsigned long delta_size, unsigned oflags)
>  {
>         void *result;
>         unsigned long result_size;
> @@ -294,7 +295,7 @@ static void resolve_delta(unsigned nr, enum object_type type,
>         if (!result)
>                 die("failed to apply delta");
>         free(delta);
> -       write_object(nr, type, result, result_size);
> +       write_object(nr, type, result, result_size, oflags);
>  }
>
>  /*
> @@ -302,7 +303,7 @@ static void resolve_delta(unsigned nr, enum object_type type,
>   * resolve all the deltified objects that are based on it.
>   */
>  static void added_object(unsigned nr, enum object_type type,
> -                        void *data, unsigned long size)
> +                        void *data, unsigned long size, unsigned oflags)
>  {
>         struct delta_info **p = &delta_list;
>         struct delta_info *info;
> @@ -313,7 +314,7 @@ static void added_object(unsigned nr, enum object_type type,
>                         *p = info->next;
>                         p = &delta_list;
>                         resolve_delta(info->nr, type, data, size,
> -                                     info->delta, info->size);
> +                                     info->delta, info->size, oflags);
>                         free(info);
>                         continue;
>                 }
> @@ -322,18 +323,19 @@ static void added_object(unsigned nr, enum object_type type,
>  }
>
>  static void unpack_non_delta_entry(enum object_type type, unsigned long size,
> -                                  unsigned nr)
> +                                  unsigned nr, unsigned oflags)
>  {
>         void *buf = get_data(size);
>
>         if (!dry_run && buf)
> -               write_object(nr, type, buf, size);
> +               write_object(nr, type, buf, size, oflags);
>         else
>                 free(buf);
>  }
>
>  static int resolve_against_held(unsigned nr, const struct object_id *base,
> -                               void *delta_data, unsigned long delta_size)
> +                               void *delta_data, unsigned long delta_size,
> +                               unsigned oflags)
>  {
>         struct object *obj;
>         struct obj_buffer *obj_buffer;
> @@ -344,12 +346,12 @@ static int resolve_against_held(unsigned nr, const struct object_id *base,
>         if (!obj_buffer)
>                 return 0;
>         resolve_delta(nr, obj->type, obj_buffer->buffer,
> -                     obj_buffer->size, delta_data, delta_size);
> +                     obj_buffer->size, delta_data, delta_size, oflags);
>         return 1;
>  }
>
>  static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
> -                              unsigned nr)
> +                              unsigned nr, unsigned oflags)
>  {
>         void *delta_data, *base;
>         unsigned long base_size;
> @@ -366,7 +368,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
>                 if (has_object_file(&base_oid))
>                         ; /* Ok we have this one */
>                 else if (resolve_against_held(nr, &base_oid,
> -                                             delta_data, delta_size))
> +                                             delta_data, delta_size, oflags))
>                         return; /* we are done */
>                 else {
>                         /* cannot resolve yet --- queue it */
> @@ -428,7 +430,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
>                 }
>         }
>
> -       if (resolve_against_held(nr, &base_oid, delta_data, delta_size))
> +       if (resolve_against_held(nr, &base_oid, delta_data, delta_size, oflags))
>                 return;
>
>         base = read_object_file(&base_oid, &type, &base_size);
> @@ -440,11 +442,11 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
>                 has_errors = 1;
>                 return;
>         }
> -       resolve_delta(nr, type, base, base_size, delta_data, delta_size);
> +       resolve_delta(nr, type, base, base_size, delta_data, delta_size, oflags);
>         free(base);
>  }
>
> -static void unpack_one(unsigned nr)
> +static void unpack_one(unsigned nr, unsigned oflags)
>  {
>         unsigned shift;
>         unsigned char *pack;
> @@ -472,11 +474,11 @@ static void unpack_one(unsigned nr)
>         case OBJ_TREE:
>         case OBJ_BLOB:
>         case OBJ_TAG:
> -               unpack_non_delta_entry(type, size, nr);
> +               unpack_non_delta_entry(type, size, nr, oflags);
>                 return;
>         case OBJ_REF_DELTA:
>         case OBJ_OFS_DELTA:
> -               unpack_delta_entry(type, size, nr);
> +               unpack_delta_entry(type, size, nr, oflags);
>                 return;
>         default:
>                 error("bad object type %d", type);
> @@ -491,6 +493,7 @@ static void unpack_all(void)
>  {
>         int i;
>         struct pack_header *hdr = fill(sizeof(struct pack_header));
> +       unsigned oflags;
>
>         nr_objects = ntohl(hdr->hdr_entries);
>
> @@ -505,9 +508,14 @@ static void unpack_all(void)
>                 progress = start_progress(_("Unpacking objects"), nr_objects);
>         CALLOC_ARRAY(obj_list, nr_objects);
>         plug_bulk_checkin();
> +       oflags = nr_objects > 1 ? HASH_N_OBJECTS : 0;
>         for (i = 0; i < nr_objects; i++) {
> -               unpack_one(i);
> -               display_progress(progress, i + 1);
> +               int nth = i + 1;
> +               unsigned f = i == 0 ? HASH_N_OBJECTS_FIRST :
> +                       nr_objects == nth ? HASH_N_OBJECTS_LAST : 0;
> +
> +               unpack_one(i, oflags | f);
> +               display_progress(progress, nth);
>         }
>         unplug_bulk_checkin();
>         stop_progress(&progress);
> diff --git a/cache.h b/cache.h
> index 84fafe2ed71..72c91c91286 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -896,6 +896,9 @@ int ie_modified(struct index_state *, const struct cache_entry *, struct stat *,
>  #define HASH_FORMAT_CHECK 2
>  #define HASH_RENORMALIZE  4
>  #define HASH_SILENT 8
> +#define HASH_N_OBJECTS 1<<4
> +#define HASH_N_OBJECTS_FIRST 1<<5
> +#define HASH_N_OBJECTS_LAST 1<<6
>  int index_fd(struct index_state *istate, struct object_id *oid, int fd, struct stat *st, enum object_type type, const char *path, unsigned flags);
>  int index_path(struct index_state *istate, struct object_id *oid, const char *path, struct stat *st, unsigned flags);
>
> --
> 2.35.1.1428.g1c1a0152d61
>

This patch works out okay because unpack-objects is the easy case.
You have a well defined number of objects.  I'd be fine with your
design if all cases were like this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 23, 2022 at 7:18 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> Remove much of this as a POC for exploring some of what I mentioned in
> https://lore.kernel.org/git/220322.86mthinxnn.gmgdl@evledraar.gmail.com/
>
> This commit is obviously not what we *should* do as end-state, but
> demonstrates what's needed (I think) for a bare-minimum implementation
> of just the "bulk" syncing method for loose objects without the part
> where we do the tmp-objdir.c dance.
>
> Performance with this is already quite promising. Benchmarking with:
>
>         git hyperfine -L rev ns/batched-fsync,HEAD -s 'make CFLAGS=-O3' \
>                 -p 'rm -rf r.git && git init --bare r.git' \
>                 './git -C r.git -c core.fsync=loose-object -c core.fsyncMethod=batch unpack-objects </tmp/pack-dotfiles.pack'
>
> I.e. unpacking a small packfile (my dotfiles) yields, on a Linux
> ramdisk:
>
>         Benchmark 1: ./git -C r.git -c core.fsync=loose-object -c core.fsyncMethod=batch unpack-objects </tmp/pack-dotfiles.pack' in 'ns/batched-fsync
>           Time (mean ± σ):     815.9 ms ±   8.2 ms    [User: 522.9 ms, System: 287.9 ms]
>           Range (min … max):   805.6 ms … 835.9 ms    10 runs
>
>         Benchmark 2: ./git -C r.git -c core.fsync=loose-object -c core.fsyncMethod=batch unpack-objects </tmp/pack-dotfiles.pack' in 'HEAD
>           Time (mean ± σ):     779.4 ms ±  15.4 ms    [User: 505.7 ms, System: 270.2 ms]
>           Range (min … max):   763.1 ms … 813.9 ms    10 runs
>
>         Summary
>           './git -C r.git -c core.fsync=loose-object -c core.fsyncMethod=batch unpack-objects </tmp/pack-dotfiles.pack' in 'HEAD' ran
>             1.05 ± 0.02 times faster than './git -C r.git -c core.fsync=loose-object -c core.fsyncMethod=batch unpack-objects </tmp/pack-dotfiles.pack' in 'ns/batched-fsync'
>
> Doing the same with "strace --summary-only", which probably helps to
> emulate cases with slower syscalls is ~15% faster than using the
> tmp-objdir indirection:
>
>         Summary
>           'strace --summary-only ./git -C r.git -c core.fsync=loose-object -c core.fsyncMethod=batch unpack-objects </tmp/pack-dotfiles.pack' in 'HEAD' ran
>             1.16 ± 0.01 times faster than 'strace --summary-only ./git -C r.git -c core.fsync=loose-object -c core.fsyncMethod=batch unpack-objects </tmp/pack-dotfiles.pack' in 'ns/batched-fsync'
>
> Which makes sense in terms of syscalls. In my case HEAD has ~101k
> calls, and the parent topic is making ~129k calls, with around 2x the
> number of unlink(), link() as expected.
>
> Of course some users will want to use the tmp-objdir.c method. So a
> version of this commit could be rewritten to come earlier in the
> series, with the "bulk" on top being optional.
>
> It seems to me that it's a much better strategy to do this whole thing
> in close_loose_object() after passing down the new HASH_N_OBJECTS /
> HASH_N_OBJECTS_FIRST / HASH_N_OBJECTS_LAST flags.
>
> Doing that for the "builtin/add.c" and "builtin/unpack-objects.c" code
> having its {un,}plug_bulk_checkin() removed here is then just a matter
> of passing down a similar set of flags indicating whether we're
> dealing with N objects, and if so if we're dealing with the last one
> or not.
>
> As we'll see in subsequent commits doing it this way also effortlessly
> integrates with other HASH_* flags. E.g. for "update-index" the code
> being rm'd here doesn't handle the interaction with
> "HASH_WRITE_OBJECT" properly, but once we've moved all this sync
> bootstrapping logic to close_loose_object() we'll never get to it if
> we're not actually writing something.
>
> This code currently doesn't use the HASH_N_OBJECTS_FIRST flag, but
> that's what we'd use later to optionally call tmp_objdir_create().
>
> Aside: This also changes logic that was a bit confusing and repetitive
> in close_loose_object(). Previously we'd first call
> batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT) which is just as
> shorthand for:
>
>         fsync_components & FSYNC_COMPONENT_LOOSE_OBJECT &&
>         fsync_method == FSYNC_METHOD_BATCH
>
> We'd then proceed to call
> fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT) later in the same
> function, which is just a way of calling fsync_or_die() if:
>
>         fsync_components & FSYNC_COMPONENT_LOOSE_OBJECT
>
> Now we instead just define a local "fsync_loose" variable by checking
> "fsync_components & FSYNC_COMPONENT_LOOSE_OBJECT", which shows us that
> the previous case of fsync_component_or_die(...)" could just be added
> to the existing "fsync_object_files > 0" branch.
>
> Note: This commit reverts much of "core.fsyncmethod: batched disk
> flushes for loose-objects". We'll set up new structures to bring what
> it was doing back in a different way. I.e. to do the tmp-objdir
> plug-in in object-file.c
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  builtin/unpack-objects.c |  2 --
>  builtin/update-index.c   |  4 ---
>  bulk-checkin.c           | 74 ----------------------------------------
>  bulk-checkin.h           |  3 --
>  cache.h                  |  5 ---
>  object-file.c            | 37 ++++++++++++++------
>  6 files changed, 26 insertions(+), 99 deletions(-)
>
> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
> index ec40c6fd966..93da436581b 100644
> --- a/builtin/unpack-objects.c
> +++ b/builtin/unpack-objects.c
> @@ -507,7 +507,6 @@ static void unpack_all(void)
>         if (!quiet)
>                 progress = start_progress(_("Unpacking objects"), nr_objects);
>         CALLOC_ARRAY(obj_list, nr_objects);
> -       plug_bulk_checkin();
>         oflags = nr_objects > 1 ? HASH_N_OBJECTS : 0;
>         for (i = 0; i < nr_objects; i++) {
>                 int nth = i + 1;
> @@ -517,7 +516,6 @@ static void unpack_all(void)
>                 unpack_one(i, oflags | f);
>                 display_progress(progress, nth);
>         }
> -       unplug_bulk_checkin();
>         stop_progress(&progress);
>
>         if (delta_list)
> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index cbd2b0d633b..95ed3c47b2e 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -1118,8 +1118,6 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>         parse_options_start(&ctx, argc, argv, prefix,
>                             options, PARSE_OPT_STOP_AT_NON_OPTION);
>
> -       /* optimize adding many objects to the object database */
> -       plug_bulk_checkin();
>         while (ctx.argc) {
>                 if (parseopt_state != PARSE_OPT_DONE)
>                         parseopt_state = parse_options_step(&ctx, options,
> @@ -1194,8 +1192,6 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>                 strbuf_release(&buf);
>         }
>
> -       /* by now we must have added all of the new objects */
> -       unplug_bulk_checkin();
>         if (split_index > 0) {
>                 if (git_config_get_split_index() == 0)
>                         warning(_("core.splitIndex is set to false; "
> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index a0dca79ba6a..577b135e39c 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -3,20 +3,15 @@
>   */
>  #include "cache.h"
>  #include "bulk-checkin.h"
> -#include "lockfile.h"
>  #include "repository.h"
>  #include "csum-file.h"
>  #include "pack.h"
>  #include "strbuf.h"
> -#include "string-list.h"
> -#include "tmp-objdir.h"
>  #include "packfile.h"
>  #include "object-store.h"
>
>  static int bulk_checkin_plugged;
>
> -static struct tmp_objdir *bulk_fsync_objdir;
> -
>  static struct bulk_checkin_state {
>         char *pack_tmp_name;
>         struct hashfile *f;
> @@ -85,40 +80,6 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state)
>         reprepare_packed_git(the_repository);
>  }
>
> -/*
> - * Cleanup after batch-mode fsync_object_files.
> - */
> -static void do_batch_fsync(void)
> -{
> -       struct strbuf temp_path = STRBUF_INIT;
> -       struct tempfile *temp;
> -
> -       if (!bulk_fsync_objdir)
> -               return;
> -
> -       /*
> -        * Issue a full hardware flush against a temporary file to ensure
> -        * that all objects are durable before any renames occur. The code in
> -        * fsync_loose_object_bulk_checkin has already issued a writeout
> -        * request, but it has not flushed any writeback cache in the storage
> -        * hardware or any filesystem logs. This fsync call acts as a barrier
> -        * to ensure that the data in each new object file is durable before
> -        * the final name is visible.
> -        */
> -       strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
> -       temp = xmks_tempfile(temp_path.buf);
> -       fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
> -       delete_tempfile(&temp);
> -       strbuf_release(&temp_path);
> -
> -       /*
> -        * Make the object files visible in the primary ODB after their data is
> -        * fully durable.
> -        */
> -       tmp_objdir_migrate(bulk_fsync_objdir);
> -       bulk_fsync_objdir = NULL;
> -}
> -
>  static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
>  {
>         int i;
> @@ -313,26 +274,6 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
>         return 0;
>  }
>
> -void prepare_loose_object_bulk_checkin(void)
> -{
> -       if (bulk_checkin_plugged && !bulk_fsync_objdir)
> -               bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> -}
> -
> -void fsync_loose_object_bulk_checkin(int fd, const char *filename)
> -{
> -       /*
> -        * If we have a plugged bulk checkin, we issue a call that
> -        * cleans the filesystem page cache but avoids a hardware flush
> -        * command. Later on we will issue a single hardware flush
> -        * before as part of do_batch_fsync.
> -        */
> -       if (!bulk_fsync_objdir ||
> -           git_fsync(fd, FSYNC_WRITEOUT_ONLY) < 0) {
> -               fsync_or_die(fd, filename);
> -       }
> -}
> -
>  int index_bulk_checkin(struct object_id *oid,
>                        int fd, size_t size, enum object_type type,
>                        const char *path, unsigned flags)
> @@ -347,19 +288,6 @@ int index_bulk_checkin(struct object_id *oid,
>  void plug_bulk_checkin(void)
>  {
>         assert(!bulk_checkin_plugged);
> -
> -       /*
> -        * A temporary object directory is used to hold the files
> -        * while they are not fsynced.
> -        */
> -       if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
> -               bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> -               if (!bulk_fsync_objdir)
> -                       die(_("Could not create temporary object directory for core.fsyncMethod=batch"));
> -
> -               tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
> -       }
> -
>         bulk_checkin_plugged = 1;
>  }
>
> @@ -369,6 +297,4 @@ void unplug_bulk_checkin(void)
>         bulk_checkin_plugged = 0;
>         if (bulk_checkin_state.f)
>                 finish_bulk_checkin(&bulk_checkin_state);
> -
> -       do_batch_fsync();
>  }
> diff --git a/bulk-checkin.h b/bulk-checkin.h
> index 181d3447ff9..b26f3dc3b74 100644
> --- a/bulk-checkin.h
> +++ b/bulk-checkin.h
> @@ -6,9 +6,6 @@
>
>  #include "cache.h"
>
> -void prepare_loose_object_bulk_checkin(void);
> -void fsync_loose_object_bulk_checkin(int fd, const char *filename);
> -
>  int index_bulk_checkin(struct object_id *oid,
>                        int fd, size_t size, enum object_type type,
>                        const char *path, unsigned flags);
> diff --git a/cache.h b/cache.h
> index 72c91c91286..2f3831fa853 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1772,11 +1772,6 @@ void fsync_or_die(int fd, const char *);
>  int fsync_component(enum fsync_component component, int fd);
>  void fsync_component_or_die(enum fsync_component component, int fd, const char *msg);
>
> -static inline int batch_fsync_enabled(enum fsync_component component)
> -{
> -       return (fsync_components & component) && (fsync_method == FSYNC_METHOD_BATCH);
> -}
> -
>  ssize_t read_in_full(int fd, void *buf, size_t count);
>  ssize_t write_in_full(int fd, const void *buf, size_t count);
>  ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
> diff --git a/object-file.c b/object-file.c
> index cd0ddb49e4b..dbeb3df502d 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1886,19 +1886,37 @@ void hash_object_file(const struct git_hash_algo *algo, const void *buf,
>         hash_object_file_literally(algo, buf, len, type_name(type), oid);
>  }
>
> +static void sync_loose_object_batch(int fd, const char *filename,
> +                                   const unsigned oflags)
> +{
> +       const int last = oflags & HASH_N_OBJECTS_LAST;
> +
> +       /*
> +        * We're doing a sync_file_range() (or equivalent) for 1..N-1
> +        * objects, and then a "real" fsync() for N. On some OS's
> +        * enabling core.fsync=loose-object && core.fsyncMethod=batch
> +        * improves the performance by a lot.
> +        */
> +       if (last || (!last && git_fsync(fd, FSYNC_WRITEOUT_ONLY) < 0))
> +               fsync_or_die(fd, filename);
> +}
> +
>  /* Finalize a file on disk, and close it. */
> -static void close_loose_object(int fd, const char *filename)
> +static void close_loose_object(int fd, const char *filename,
> +                              const unsigned oflags)
>  {
> +       int fsync_loose;
> +
>         if (the_repository->objects->odb->will_destroy)
>                 goto out;
>
> -       if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
> -               fsync_loose_object_bulk_checkin(fd, filename);
> -       else if (fsync_object_files > 0)
> +       fsync_loose = fsync_components & FSYNC_COMPONENT_LOOSE_OBJECT;
> +
> +       if (oflags & HASH_N_OBJECTS && fsync_loose &&
> +           fsync_method == FSYNC_METHOD_BATCH)
> +               sync_loose_object_batch(fd, filename, oflags);
> +       else if (fsync_object_files > 0 || fsync_loose)
>                 fsync_or_die(fd, filename);
> -       else
> -               fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd,
> -                                      filename);
>
>  out:
>         if (close(fd) != 0)
> @@ -1962,9 +1980,6 @@ static int write_loose_object(const struct object_id *oid, char *hdr,
>         static struct strbuf tmp_file = STRBUF_INIT;
>         static struct strbuf filename = STRBUF_INIT;
>
> -       if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
> -               prepare_loose_object_bulk_checkin();
> -
>         loose_object_path(the_repository, &filename, oid);
>
>         fd = create_tmpfile(&tmp_file, filename.buf);
> @@ -2015,7 +2030,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr,
>                 die(_("confused by unstable object source data for %s"),
>                     oid_to_hex(oid));
>
> -       close_loose_object(fd, tmp_file.buf);
> +       close_loose_object(fd, tmp_file.buf, flags);
>
>         if (mtime) {
>                 struct utimbuf utb;
> --
> 2.35.1.1428.g1c1a0152d61
>

Fine. Doing this patch series as non-RFC, we could start from prior to
my fsyncMethod=batch series.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 23, 2022 at 7:18 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> As with unpack-objects in a preceding commit have update-index.c make
> use of the HASH_N_OBJECTS{,_{FIRST,LAST}} flags. We now have a "batch"
> mode again for "update-index".
>
> Adding the t/* directory from git.git on a Linux ramdisk is a bit
> faster than with the tmp-objdir indirection:
>
>         $ git hyperfine -L rev ns/batched-fsync,HEAD -s 'make CFLAGS=-O3 && rm -rf repo && git init repo && cp -R t repo/ && git ls-files -- t >repo/.git/to-add.txt' -p 'rm -rf repo/.git/objects/* repo/.git/index' './git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin <repo/.git/to-add.txt' --warmup 1 -r 10Benchmark 1: ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin <repo/.git/to-add.txt' in 'ns/batched-fsync
>           Time (mean ± σ):     281.1 ms ±   2.6 ms    [User: 186.2 ms, System: 92.3 ms]
>           Range (min … max):   278.3 ms … 287.0 ms    10 runs
>
>         Benchmark 2: ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin <repo/.git/to-add.txt' in 'HEAD
>           Time (mean ± σ):     265.9 ms ±   2.6 ms    [User: 181.7 ms, System: 82.1 ms]
>           Range (min … max):   262.0 ms … 270.3 ms    10 runs
>
>         Summary
>           './git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin <repo/.git/to-add.txt' in 'HEAD' ran
>             1.06 ± 0.01 times faster than './git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin <repo/.git/to-add.txt' in 'ns/batched-fsync'
>
> And as before running that with "strace --summary-only" slows things
> down a bit (probably mimicking slower I/O a bit). I then get:
>
>         Summary
>           'strace --summary-only ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin <repo/.git/to-add.txt' in 'HEAD' ran
>             1.19 ± 0.03 times faster than 'strace --summary-only ./git -c core.fsync=loose-object -c core.fsyncMethod=batch -C repo update-index --add --stdin <repo/.git/to-add.txt' in 'ns/batched-fsync'
>
> This one has a twist though, instead of fsync()-ing on the last object
> we write we'll not do that, and instead defer the fsync() until we
> write the index itself. This is outlined in [1] (as "METHOD THREE").
>
> Because of this under FSYNC_METHOD_BATCH we'll do the N
> objects (possibly only one, because we're lazy) as HASH_N_OBJECTS, and
> we'll even now support doing this via N arguments on the command-line.
>
> Then we won't fsync() any of it, but we will rename it
> in-place (which, if we were still using the tmp-objdir, would leave it
> "staged" in the tmp-objdir).
>
> We'll then have the fsync() for the index update "flush" that out, and
> thus avoid two fsync() calls when one will do.
>
> Running this with the "git hyperfine" command mentioned in a preceding
> commit with "strace --summary-only" shows that we do 1 fsync() now
> instead of 2, and have one more sync_file_range(), as expected.
>
> We also go from ~51k syscalls to ~39k, with ~2x the number of link()
> and unlink() in ns/batched-fsync, and of course one fsync() instead of
> two()>
>
> The flow of this code isn't quite set up for re-plugging the
> tmp-objdir back in. In particular we no longer pass
> HASH_N_OBJECTS_FIRST (but doing so would be trivial)< and there's no
> HASH_N_OBJECTS_LAST.
>
> So this and other callers would need some light transaction-y API, or
> to otherwise pass down a "yes, I'd like to flush it" down to
> finalize_hashfile(), but doing so will be trivial.
>
> And since we've started structuring it this way it'll become easy to
> do any arbitrary number of things down the line that would "bulk
> fsync" before the final fsync(). Now we write some objects and fsync()
> on the index, but between those two could do any number of other
> things where we'd defer the fsync().
>
> This sort of thing might be especially interesting for "git repack"
> when it writes e.g. a *.bitmap, *.rev, *.pack and *.idx. In that case
> we could skip the fsync() on all of those, and only do it on the *.idx
> before we renamed it in-place. I *think* nothing cares about a *.pack
> without an *.idx, but even then we could fsync *.idx, rename *.pack,
> rename *.idx and still safely do only one fsync(). See "git show
> --first-parent" on 62874602032 (Merge branch
> 'tb/pack-finalize-ordering' into maint, 2021-10-12) for a good
> overview of the code involved in that.
>
> 1. https://lore.kernel.org/git/220323.86sfr9ndpr.gmgdl@evledraar.gmail.com/
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  builtin/update-index.c |  7 ++++---
>  cache.h                |  1 +
>  read-cache.c           | 29 ++++++++++++++++++++++++++++-
>  3 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index 34aaaa16c20..6cfec6efb38 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -1142,7 +1142,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>
>                         setup_work_tree();
>                         p = prefix_path(prefix, prefix_length, path);
> -                       update_one(p, 0);
> +                       update_one(p, HASH_N_OBJECTS);
>                         if (set_executable_bit)
>                                 chmod_path(set_executable_bit, p);
>                         free(p);
> @@ -1187,7 +1187,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>                                 strbuf_swap(&buf, &unquoted);
>                         }
>                         p = prefix_path(prefix, prefix_length, buf.buf);
> -                       update_one(p, 0);
> +                       update_one(p, HASH_N_OBJECTS);
>                         if (set_executable_bit)
>                                 chmod_path(set_executable_bit, p);
>                         free(p);
> @@ -1263,7 +1263,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>                                 exit(128);
>                         unable_to_lock_die(get_index_file(), lock_error);
>                 }
> -               if (write_locked_index(&the_index, &lock_file, COMMIT_LOCK))
> +               if (write_locked_index(&the_index, &lock_file,
> +                                      COMMIT_LOCK | WLI_NEED_LOOSE_FSYNC))
>                         die("Unable to write new index file");
>         }
>
> diff --git a/cache.h b/cache.h
> index 2f3831fa853..7542e009a34 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -751,6 +751,7 @@ void ensure_full_index(struct index_state *istate);
>  /* For use with `write_locked_index()`. */
>  #define COMMIT_LOCK            (1 << 0)
>  #define SKIP_IF_UNCHANGED      (1 << 1)
> +#define WLI_NEED_LOOSE_FSYNC   (1 << 2)
>
>  /*
>   * Write the index while holding an already-taken lock. Close the lock,
> diff --git a/read-cache.c b/read-cache.c
> index 3e0e7d41837..275f6308c32 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -2860,6 +2860,33 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>         int ieot_entries = 1;
>         struct index_entry_offset_table *ieot = NULL;
>         int nr, nr_threads;
> +       unsigned int wflags = FSYNC_COMPONENT_INDEX;
> +
> +
> +       /*
> +        * TODO: This is abuse of the API recently modified
> +        * finalize_hashfile() which reveals a shortcoming of its
> +        * "fsync" design.
> +        *
> +        * I.e. It expects a "enum fsync_component component" label,
> +        * but here we're passing it an OR of the two, knowing that
> +        * it'll call fsync_component_or_die() which (in
> +        * write-or-die.c) will do "(fsync_components & wflags)" (to
> +        * our "wflags" here).
> +        *
> +        * But the API really should be changed to explicitly take
> +        * such flags, because in this case we'd like to fsync() the
> +        * index if we're in the bulk mode, *even if* our
> +        * "core.fsync=index" isn't configured.
> +        *
> +        * That's because at this point we've been queuing up object
> +        * writes that we didn't fsync(), and are going to use this
> +        * fsync() to "flush" the whole thing. Doing it this way
> +        * avoids redundantly calling fsync() twice when once will do.
> +        */
> +       if (fsync_method == FSYNC_METHOD_BATCH &&
> +           flags & WLI_NEED_LOOSE_FSYNC)
> +               wflags |= FSYNC_COMPONENT_LOOSE_OBJECT;
>
>         f = hashfd(tempfile->fd, tempfile->filename.buf);
>
> @@ -3094,7 +3121,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>         if (!alternate_index_output && (flags & COMMIT_LOCK))
>                 csum_fsync_flag = CSUM_FSYNC;
>
> -       finalize_hashfile(f, istate->oid.hash, FSYNC_COMPONENT_INDEX,
> +       finalize_hashfile(f, istate->oid.hash, wflags,
>                           CSUM_HASH_IN_STREAM | csum_fsync_flag);
>
>         if (close_tempfile_gently(tempfile)) {
> --
> 2.35.1.1428.g1c1a0152d61
>

In the long run, we should attach the "need to fsync the index" to an
ongoing 'repo-transaction' so that we can composably sync at the best
point regardless of what the top-level git operation does.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 23, 2022 at 7:18 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> Add a new fsyncMethod.batch.quarantine setting which defaults to
> "false". Preceding (RFC, and not meant to flip-flop like that
> eventually) commits ripped out the "tmp-objdir" part of the
> core.fsyncMethod=batch.
>
> This documentation proposes to keep that as the default for the
> reasons discussed in it, while allowing users to set
> "fsyncMethod.batch.quarantine=true".
>
> Furthermore update the discussion of "core.fsyncObjectFiles" with
> information about what it *really* does, why you probably shouldn't
> use it, and how to safely emulate most of what it gave users in the
> past in terms of performance benefit.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  Documentation/config/core.txt | 80 +++++++++++++++++++++++++++++++----
>  1 file changed, 72 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> index f598925b597..365a12dc7ae 100644
> --- a/Documentation/config/core.txt
> +++ b/Documentation/config/core.txt
> @@ -607,21 +607,85 @@ stored on NTFS or ReFS filesystems.
>  +
>  The `batch` is currently only applies to loose-object files and will
>  kick in when using the linkgit:git-unpack-objects[1] and
> -linkgit:update-index[1] commands. Note that the "last" file to be
> +linkgit:git-update-index[1] commands. Note that the "last" file to be
>  synced may be the last object, as in the case of
>  linkgit:git-unpack-objects[1], or relevant "index" (or in the future,
>  "ref") update, as in the case of linkgit:git-update-index[1]. I.e. the
>  batch syncing of the loose objects may be deferred until a subsequent
>  fsync() to a file that makes them "active".
>
> +fsyncMethod.batch.quarantine::
> +       A boolean which if set to `true` will cause "batched" writes
> +       to objects to be "quarantined" if
> +       `core.fsyncMethod=batch`. This is `false` by default.
> ++
> +The primary object of these fsync() settings is to protect against
> +repository corruption of things which are reachable, i.e. "reachable",
> +via references, the index etc. Not merely objects that were present in
> +the object store.
> ++
> +Historically setting `core.fsyncObjectFiles=false` assumed that on a
> +filesystem with where an fsync() would flush all preceding outstanding
> +I/O that we might end up with a corrupt loose object, but that was OK
> +as long as no reference referred to it. We'd eventually the corrupt
> +object with linkgit:git-gc[1], and linkgit:git-fsck[1] would only
> +report it as a minor annoyance
> ++
> +Setting `fsyncMethod.batch.quarantine=true` takes the view that
> +something like a corrupt *unreferenced* loose object in the object
> +store is something we'd like to avoid, at the cost of reduced
> +performance when using `core.fsyncMethod=batch`.
> ++
> +Currently this uses the same mechanism described in the "QUARANTINE
> +ENVIRONMENT" in the linkgit:git-receive-pack[1] documentation, but
> +that's subject to change. The performance loss is because we need to
> +"stage" the objects in that quarantine environment, fsync() it, and
> +once that's done rename() or link() it in-place into the main object
> +store, possibly with an fsync() of the index or ref at the end
> ++
> +With `fsyncMethod.batch.quarantine=false` we'll "stage" things in the
> +main object store, and then do one fsync() at the very end, either on
> +the last object we write, or file (index or ref) that'll make it
> +"reachable".
> ++
> +The bad thing about setting this to `true` is lost performance, as
> +well as not being able to access the objects as they're written (which
> +e.g. consumers of linkgit:git-update-index[1]'s `--verbose` mode might
> +want to do).

I wasn't able to understand clearly from your performance numbers.
What did you measure as the additional cost from quarantine=true
versus quarantine=false? Just if you have the numbers handy...

> ++
> +The good thing is that you should be guaranteed not to get e.g. short
> +or otherwise corrupt loose objects if you pull your power cord, in
> +practice various git commands deal quite badly with discovering such a
> +stray corrupt object (including perhaps assuming it's valid based on
> +its existence, or hard dying on an error rather than replacing
> +it). Repairing such "unreachable corruption" can require manual
> +intervention.
> +
>  core.fsyncObjectFiles::
> -       This boolean will enable 'fsync()' when writing object files.
> -       This setting is deprecated. Use core.fsync instead.
> -+
> -This setting affects data added to the Git repository in loose-object
> -form. When set to true, Git will issue an fsync or similar system call
> -to flush caches so that loose-objects remain consistent in the face
> -of a unclean system shutdown.
> +       This boolean will enable 'fsync()' when writing loose object
> +       files.
> ++
> +This setting is the historical fsync configuration setting. It's now
> +*deprecated*, you should use `core.fsync` instead, perhaps in
> +combination with `core.fsyncMethod=batch`.
> ++
> +The `core.fsyncObjectFiles` was initially added based on integrity
> +assumptions that early (pre-ext-4) versions of Linux's "ext"
> +filesystems provided.
> ++
> +I.e. that a write of file A without an `fsync()` followed by a write
> +of file `B` with `fsync()` would implicitly guarantee that `A' would
> +be `fsync()`'d by calling `fsync()` on `B`. This asssumption is *not*
> +backed up by any standard (e.g. POSIX), but worked in practice on some
> +Linux setups.
> ++
> +Nowadays you should almost certainly want to use
> +`core.fsync=loose-object` instead in combination with
> +`core.fsyncMethod=bulk`, and possibly with
> +`fsyncMethod.batch.quarantine=true`, see above. On modern OS's (Linux,
> +OSX, Windows) that gives you most of the performance benefit of
> +`core.fsyncObjectFiles=false` with all of the safety of the old
> +`core.fsyncObjectFiles=true`.
>
>  core.preloadIndex::
>         Enable parallel index preload for operations like 'git diff'
> --
> 2.35.1.1428.g1c1a0152d61
>

I think the notion of minimizing fsyncs across the whole repository is
a great one.  However, your implementation isn't clean from an API
perspective, since people modifying the top-level commands need to
reason about the full set of operations to avoid silently breaking the
fsync requirements.  I think we should phrase this as a "transaction"
that the top level command can begin and end. Subcomponents of the
repo can "enlist" in the transaction and do the right thing optimally
when the overall transaction commits or aborts.

In the end, I think the optimal solution should be layered on top of
the final form of my current patch series as an incremental
improvement.  I'm going to start the rebranding of
plug/unplug_bulk_checkin in V3 of the patch series.

Thanks,
Neeraj
LAST MAIL

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +* `batch` enables a mode that uses writeout-only flushes to stage multiple
> +  updates in the disk writeback cache and then does a single full fsync of
> +  a dummy file to trigger the disk cache flush at the end of the operation.

It is unfortunate that we have a rather independent "unplug" that is
not tied to the "this is the last operation in the batch"---if there
were we didn't have to invent a dummy but a single full sync on the
real file who happened to be the last one in the batch would be
sufficient.  It would not matter, if the batch is any meaningful
size, hopefully.

> +/*
> + * Cleanup after batch-mode fsync_object_files.
> + */
> +static void do_batch_fsync(void)
> +{
> +	/*
> +	 * Issue a full hardware flush against a temporary file to ensure
> +	 * that all objects are durable before any renames occur.  The code in
> +	 * fsync_loose_object_bulk_checkin has already issued a writeout
> +	 * request, but it has not flushed any writeback cache in the storage
> +	 * hardware.
> +	 */
> +
> +	if (needs_batch_fsync) {
> +		struct strbuf temp_path = STRBUF_INIT;
> +		struct tempfile *temp;
> +
> +		strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
> +		temp = xmks_tempfile(temp_path.buf);
> +		fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
> +		delete_tempfile(&temp);
> +		strbuf_release(&temp_path);
> +		needs_batch_fsync = 0;
> +	}
> +
> +	if (bulk_fsync_objdir) {
> +		tmp_objdir_migrate(bulk_fsync_objdir);
> +		bulk_fsync_objdir = NULL;

The struct obtained from tmp_objdir_create() is consumed by
tmp_objdir_migrate() so the only clean-up left for the caller to do
is to clear it to NULL.  OK.

> +	}

This initially made me wonder why we need two independent flags.
After applying this patch but not any later steps, upon plugging, we
create the tentative object directory, and any loose object will be
created there, but because nobody calls the writeout-only variant
via fsync_loose_object_bulk_checkin() yet, needs_batch_fsync may not
be turned on.  But even in that case, any new loose objects are in
the tentative object directory and need to be migrated to the real
place.

And we may not cover all the existing code paths at the end of the
series, or any new code paths right away after they get introduced,
to be aware of the fsync_loose_object_bulk_checkin() when they
create a loose object file, so it is most likely that these two if
statements will be with us forever.

OK.

> @@ -274,6 +311,24 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
>  	return 0;
>  }
>  
> +void fsync_loose_object_bulk_checkin(int fd)
> +{
> +	/*
> +	 * If we have a plugged bulk checkin, we issue a call that
> +	 * cleans the filesystem page cache but avoids a hardware flush
> +	 * command. Later on we will issue a single hardware flush
> +	 * before as part of do_batch_fsync.
> +	 */
> +	if (bulk_checkin_plugged &&
> +	    git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
> +		assert(bulk_fsync_objdir);
> +		if (!needs_batch_fsync)
> +			needs_batch_fsync = 1;

Except for when we unplug, do we ever flip needs_batch_fsync bit
off, once it is set?  If the answer is no, wouldn't it be clearer to
unconditionally set it, instead of "set it only for the first time"?

> +	} else {
> +		fsync_or_die(fd, "loose object file");
> +	}
> +}
> +
>  int index_bulk_checkin(struct object_id *oid,
>  		       int fd, size_t size, enum object_type type,
>  		       const char *path, unsigned flags)
> @@ -288,6 +343,19 @@ int index_bulk_checkin(struct object_id *oid,
>  void plug_bulk_checkin(void)
>  {
>  	assert(!bulk_checkin_plugged);
> +
> +	/*
> +	 * A temporary object directory is used to hold the files
> +	 * while they are not fsynced.
> +	 */
> +	if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
> +		bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> +		if (!bulk_fsync_objdir)
> +			die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
> +
> +		tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
> +	}
> +
>  	bulk_checkin_plugged = 1;
>  }
>  
> @@ -297,4 +365,6 @@ void unplug_bulk_checkin(void)
>  	bulk_checkin_plugged = 0;
>  	if (bulk_checkin_state.f)
>  		finish_bulk_checkin(&bulk_checkin_state);
> +
> +	do_batch_fsync();
>  }
> diff --git a/bulk-checkin.h b/bulk-checkin.h
> index b26f3dc3b74..08f292379b6 100644
> --- a/bulk-checkin.h
> +++ b/bulk-checkin.h
> @@ -6,6 +6,8 @@
>  
>  #include "cache.h"
>  
> +void fsync_loose_object_bulk_checkin(int fd);
> +
>  int index_bulk_checkin(struct object_id *oid,
>  		       int fd, size_t size, enum object_type type,
>  		       const char *path, unsigned flags);
> diff --git a/cache.h b/cache.h
> index 3160bc1e489..d1ae51388c9 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1040,7 +1040,8 @@ extern int use_fsync;
>  
>  enum fsync_method {
>  	FSYNC_METHOD_FSYNC,
> -	FSYNC_METHOD_WRITEOUT_ONLY
> +	FSYNC_METHOD_WRITEOUT_ONLY,
> +	FSYNC_METHOD_BATCH
>  };

Style.

These days we allow trailing comma to enum definitions.  Perhaps
give a trailing comma after _BATCH so that the next update patch
will become less noisy?

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Mon, Mar 21, 2022 at 10:30 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +* `batch` enables a mode that uses writeout-only flushes to stage multiple
> > +  updates in the disk writeback cache and then does a single full fsync of
> > +  a dummy file to trigger the disk cache flush at the end of the operation.
>
> It is unfortunate that we have a rather independent "unplug" that is
> not tied to the "this is the last operation in the batch"---if there
> were we didn't have to invent a dummy but a single full sync on the
> real file who happened to be the last one in the batch would be
> sufficient.  It would not matter, if the batch is any meaningful
> size, hopefully.
>

I'm banking on  a large batch size or the fact that the additional
cost of creating
and syncing an empty file to be so small that it wouldn't be
noticeable event for
small batches. The current unfortunate scheme at least has a very simple API
that's easy to apply to any other operation going forward. For instance
builtin/hash-object.c might be another good operation, but it wasn't clear to me
if it's used for any mainline scenario.

> > +/*
> > + * Cleanup after batch-mode fsync_object_files.
> > + */
> > +static void do_batch_fsync(void)
> > +{
> > +     /*
> > +      * Issue a full hardware flush against a temporary file to ensure
> > +      * that all objects are durable before any renames occur.  The code in
> > +      * fsync_loose_object_bulk_checkin has already issued a writeout
> > +      * request, but it has not flushed any writeback cache in the storage
> > +      * hardware.
> > +      */
> > +
> > +     if (needs_batch_fsync) {
> > +             struct strbuf temp_path = STRBUF_INIT;
> > +             struct tempfile *temp;
> > +
> > +             strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
> > +             temp = xmks_tempfile(temp_path.buf);
> > +             fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
> > +             delete_tempfile(&temp);
> > +             strbuf_release(&temp_path);
> > +             needs_batch_fsync = 0;
> > +     }
> > +
> > +     if (bulk_fsync_objdir) {
> > +             tmp_objdir_migrate(bulk_fsync_objdir);
> > +             bulk_fsync_objdir = NULL;
>
> The struct obtained from tmp_objdir_create() is consumed by
> tmp_objdir_migrate() so the only clean-up left for the caller to do
> is to clear it to NULL.  OK.
>
> > +     }
>
> This initially made me wonder why we need two independent flags.
> After applying this patch but not any later steps, upon plugging, we
> create the tentative object directory, and any loose object will be
> created there, but because nobody calls the writeout-only variant
> via fsync_loose_object_bulk_checkin() yet, needs_batch_fsync may not
> be turned on.  But even in that case, any new loose objects are in
> the tentative object directory and need to be migrated to the real
> place.
>
> And we may not cover all the existing code paths at the end of the
> series, or any new code paths right away after they get introduced,
> to be aware of the fsync_loose_object_bulk_checkin() when they
> create a loose object file, so it is most likely that these two if
> statements will be with us forever.
>
> OK.

After Avarb's last feedback, I've changed this to lazily create the objdir, so
the existence of an objdir is a suitable proxy for there being something worth
syncing. The potential downside is that the lazy-creation would need to be
synchronized if the ODB becomes multithreaded.

>
> > @@ -274,6 +311,24 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
> >       return 0;
> >  }
> >
> > +void fsync_loose_object_bulk_checkin(int fd)
> > +{
> > +     /*
> > +      * If we have a plugged bulk checkin, we issue a call that
> > +      * cleans the filesystem page cache but avoids a hardware flush
> > +      * command. Later on we will issue a single hardware flush
> > +      * before as part of do_batch_fsync.
> > +      */
> > +     if (bulk_checkin_plugged &&
> > +         git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
> > +             assert(bulk_fsync_objdir);
> > +             if (!needs_batch_fsync)
> > +                     needs_batch_fsync = 1;
>
> Except for when we unplug, do we ever flip needs_batch_fsync bit
> off, once it is set?  If the answer is no, wouldn't it be clearer to
> unconditionally set it, instead of "set it only for the first time"?
>

This code is now gone. I was stupidly optimizing for a future
multithreaded world
which might never come.

> > +     } else {
> > +             fsync_or_die(fd, "loose object file");
> > +     }
> > +}
> > +
> >  int index_bulk_checkin(struct object_id *oid,
> >                      int fd, size_t size, enum object_type type,
> >                      const char *path, unsigned flags)
> > @@ -288,6 +343,19 @@ int index_bulk_checkin(struct object_id *oid,
> >  void plug_bulk_checkin(void)
> >  {
> >       assert(!bulk_checkin_plugged);
> > +
> > +     /*
> > +      * A temporary object directory is used to hold the files
> > +      * while they are not fsynced.
> > +      */
> > +     if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
> > +             bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> > +             if (!bulk_fsync_objdir)
> > +                     die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
> > +
> > +             tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
> > +     }
> > +
> >       bulk_checkin_plugged = 1;
> >  }
> >
> > @@ -297,4 +365,6 @@ void unplug_bulk_checkin(void)
> >       bulk_checkin_plugged = 0;
> >       if (bulk_checkin_state.f)
> >               finish_bulk_checkin(&bulk_checkin_state);
> > +
> > +     do_batch_fsync();
> >  }
> > diff --git a/bulk-checkin.h b/bulk-checkin.h
> > index b26f3dc3b74..08f292379b6 100644
> > --- a/bulk-checkin.h
> > +++ b/bulk-checkin.h
> > @@ -6,6 +6,8 @@
> >
> >  #include "cache.h"
> >
> > +void fsync_loose_object_bulk_checkin(int fd);
> > +
> >  int index_bulk_checkin(struct object_id *oid,
> >                      int fd, size_t size, enum object_type type,
> >                      const char *path, unsigned flags);
> > diff --git a/cache.h b/cache.h
> > index 3160bc1e489..d1ae51388c9 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -1040,7 +1040,8 @@ extern int use_fsync;
> >
> >  enum fsync_method {
> >       FSYNC_METHOD_FSYNC,
> > -     FSYNC_METHOD_WRITEOUT_ONLY
> > +     FSYNC_METHOD_WRITEOUT_ONLY,
> > +     FSYNC_METHOD_BATCH
> >  };
>
> Style.
>
> These days we allow trailing comma to enum definitions.  Perhaps
> give a trailing comma after _BATCH so that the next update patch
> will become less noisy?
>

Fixed.

> Thanks.

Thanks!
-Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> index 9da3e5d88f6..3c90ba0b395 100644
> --- a/Documentation/config/core.txt
> +++ b/Documentation/config/core.txt
> @@ -596,6 +596,14 @@ core.fsyncMethod::
>  * `writeout-only` issues pagecache writeback requests, but depending on the
>    filesystem and storage hardware, data added to the repository may not be
>    durable in the event of a system crash. This is the default mode on macOS.
> +* `batch` enables a mode that uses writeout-only flushes to stage multiple
> +  updates in the disk writeback cache and then does a single full fsync of
> +  a dummy file to trigger the disk cache flush at the end of the operation.
> ++
> +  Currently `batch` mode only applies to loose-object files. Other repository
> +  data is made durable as if `fsync` was specified. This mode is expected to
> +  be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems
> +  and on Windows for repos stored on NTFS or ReFS filesystems.

Does this format correctly?  I had an impression that the second and
subsequent paragraphs, connected with a line with a single "+" on
it, has to be flushed left without indentation.

> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 8b0fd5c7723..9799d247cad 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -3,15 +3,20 @@
>   */
>  #include "cache.h"
>  #include "bulk-checkin.h"
> +#include "lockfile.h"
>  #include "repository.h"
>  #include "csum-file.h"
>  #include "pack.h"
>  #include "strbuf.h"
> +#include "string-list.h"
> +#include "tmp-objdir.h"
>  #include "packfile.h"
>  #include "object-store.h"
>  
>  static int odb_transaction_nesting;
>  
> +static struct tmp_objdir *bulk_fsync_objdir;

I wonder if this should be added to the bulk_checkin_state structure
as a new member, especially if we fix the erroneous call to
finish_bulk_checkin() as a preliminary fix-up of a bug that existed
even before this series.

> +/*
> + * Cleanup after batch-mode fsync_object_files.
> + */
> +static void do_batch_fsync(void)
> +{
> +	struct strbuf temp_path = STRBUF_INIT;
> +	struct tempfile *temp;
> +
> +	if (!bulk_fsync_objdir)
> +		return;
> +
> +	/*
> +	 * Issue a full hardware flush against a temporary file to ensure
> +	 * that all objects are durable before any renames occur. The code in
> +	 * fsync_loose_object_bulk_checkin has already issued a writeout
> +	 * request, but it has not flushed any writeback cache in the storage
> +	 * hardware or any filesystem logs. This fsync call acts as a barrier
> +	 * to ensure that the data in each new object file is durable before
> +	 * the final name is visible.
> +	 */
> +	strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
> +	temp = xmks_tempfile(temp_path.buf);
> +	fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
> +	delete_tempfile(&temp);
> +	strbuf_release(&temp_path);
> +
> +	/*
> +	 * Make the object files visible in the primary ODB after their data is
> +	 * fully durable.
> +	 */
> +	tmp_objdir_migrate(bulk_fsync_objdir);
> +	bulk_fsync_objdir = NULL;
> +}

OK.

> +void prepare_loose_object_bulk_checkin(void)
> +{
> +	/*
> +	 * We lazily create the temporary object directory
> +	 * the first time an object might be added, since
> +	 * callers may not know whether any objects will be
> +	 * added at the time they call begin_odb_transaction.
> +	 */
> +	if (!odb_transaction_nesting || bulk_fsync_objdir)
> +		return;
> +
> +	bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> +	if (bulk_fsync_objdir)
> +		tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
> +}

OK.  If we got a failure from tmp_objdir_create(), then we don't
swap and end up creating a new loose object file in the primary
object store.  I wonder if we at least want to note that fact for
later use at "unplug" time.  We may create a few loose objects in
the primary object store without any fsync, then a later call may
successfully create a temporary object directory and we'd create
more loose objects in the temporary one, which are flushed with the
"create a dummy and fsync" trick and migrated, but do we need to do
something to the ones we created in the primary object store before
all that happens?

> +void fsync_loose_object_bulk_checkin(int fd, const char *filename)
> +{
> +	/*
> +	 * If we have an active ODB transaction, we issue a call that
> +	 * cleans the filesystem page cache but avoids a hardware flush
> +	 * command. Later on we will issue a single hardware flush
> +	 * before as part of do_batch_fsync.
> +	 */
> +	if (!bulk_fsync_objdir ||
> +	    git_fsync(fd, FSYNC_WRITEOUT_ONLY) < 0) {
> +		fsync_or_die(fd, filename);
> +	}
> +}

Ah, if we have successfully created the temporary directory, we
don't do full fsync but just writeout-only one, so there is no need
for the worry I mentioned in the previous paragraph.  OK.

> @@ -301,4 +370,6 @@ void end_odb_transaction(void)
>  
>  	if (bulk_checkin_state.f)
>  		finish_bulk_checkin(&bulk_checkin_state);
> +
> +	do_batch_fsync();
>  }

OK.

> @@ -1961,6 +1963,9 @@ static int write_loose_object(const struct object_id *oid, char *hdr,
>  	static struct strbuf tmp_file = STRBUF_INIT;
>  	static struct strbuf filename = STRBUF_INIT;
>  
> +	if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
> +		prepare_loose_object_bulk_checkin();
> +
>  	loose_object_path(the_repository, &filename, oid);
>  
>  	fd = create_tmpfile(&tmp_file, filename.buf);

The necessary change to the "workhorse" code path is surprisingly
small, which is pleasing to see.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 30, 2022 at 10:37 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> > index 9da3e5d88f6..3c90ba0b395 100644
> > --- a/Documentation/config/core.txt
> > +++ b/Documentation/config/core.txt
> > @@ -596,6 +596,14 @@ core.fsyncMethod::
> >  * `writeout-only` issues pagecache writeback requests, but depending on the
> >    filesystem and storage hardware, data added to the repository may not be
> >    durable in the event of a system crash. This is the default mode on macOS.
> > +* `batch` enables a mode that uses writeout-only flushes to stage multiple
> > +  updates in the disk writeback cache and then does a single full fsync of
> > +  a dummy file to trigger the disk cache flush at the end of the operation.
> > ++
> > +  Currently `batch` mode only applies to loose-object files. Other repository
> > +  data is made durable as if `fsync` was specified. This mode is expected to
> > +  be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems
> > +  and on Windows for repos stored on NTFS or ReFS filesystems.
>
> Does this format correctly?  I had an impression that the second and
> subsequent paragraphs, connected with a line with a single "+" on
> it, has to be flushed left without indentation.
>

Here's the man format (Not sure how this will render going through gmail):
           •   batch enables a mode that uses writeout-only flushes to
stage multiple updates in the disk writeback cache and then does a
single full fsync of a dummy file to trigger the disk cache flush at
the end of the
               operation.

                   Currently `batch` mode only applies to loose-object
files. Other repository
                   data is made durable as if `fsync` was specified.
This mode is expected to
                   be as safe as `fsync` on macOS for repos stored on
HFS+ or APFS filesystems
                   and on Windows for repos stored on NTFS or ReFS filesystems.

To describe the above if it doesn't render correctly, we have a
bulleted list where the batch after the * is bolded.  Other instances
of single backtick quoted text just appears as plaintext. The
descriptive "Currently `batch` mode..." paragraph is under the bullet
point and well-indented.

In HTML the output looks good as well, except that the descriptive
paragraph is in monospace for some reason.

> > diff --git a/bulk-checkin.c b/bulk-checkin.c
> > index 8b0fd5c7723..9799d247cad 100644
> > --- a/bulk-checkin.c
> > +++ b/bulk-checkin.c
> > @@ -3,15 +3,20 @@
> >   */
> >  #include "cache.h"
> >  #include "bulk-checkin.h"
> > +#include "lockfile.h"
> >  #include "repository.h"
> >  #include "csum-file.h"
> >  #include "pack.h"
> >  #include "strbuf.h"
> > +#include "string-list.h"
> > +#include "tmp-objdir.h"
> >  #include "packfile.h"
> >  #include "object-store.h"
> >
> >  static int odb_transaction_nesting;
> >
> > +static struct tmp_objdir *bulk_fsync_objdir;
>
> I wonder if this should be added to the bulk_checkin_state structure
> as a new member, especially if we fix the erroneous call to
> finish_bulk_checkin() as a preliminary fix-up of a bug that existed
> even before this series.
>

It seems like the only thing tying this to the bulk_checkin_state
(which I've renamed in my local changes to bulk_checkin_packfile) is
that they're both generally written when a transaction is active.
Keeping fsync separate from packfile should help the reader see that
the two sets of functions access only their respective global state.
If we add another optimization strategy (e.g. appendable pack files),
it would get its own separate state and functions that are independent
of the large-blob packfile and loose-object fsync optimizations.

> > +/*
> > + * Cleanup after batch-mode fsync_object_files.
> > + */
> > +static void do_batch_fsync(void)
> > +{
> > +     struct strbuf temp_path = STRBUF_INIT;
> > +     struct tempfile *temp;
> > +
> > +     if (!bulk_fsync_objdir)
> > +             return;
> > +
> > +     /*
> > +      * Issue a full hardware flush against a temporary file to ensure
> > +      * that all objects are durable before any renames occur. The code in
> > +      * fsync_loose_object_bulk_checkin has already issued a writeout
> > +      * request, but it has not flushed any writeback cache in the storage
> > +      * hardware or any filesystem logs. This fsync call acts as a barrier
> > +      * to ensure that the data in each new object file is durable before
> > +      * the final name is visible.
> > +      */
> > +     strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
> > +     temp = xmks_tempfile(temp_path.buf);
> > +     fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
> > +     delete_tempfile(&temp);
> > +     strbuf_release(&temp_path);
> > +
> > +     /*
> > +      * Make the object files visible in the primary ODB after their data is
> > +      * fully durable.
> > +      */
> > +     tmp_objdir_migrate(bulk_fsync_objdir);
> > +     bulk_fsync_objdir = NULL;
> > +}
>
> OK.
>
> > +void prepare_loose_object_bulk_checkin(void)
> > +{
> > +     /*
> > +      * We lazily create the temporary object directory
> > +      * the first time an object might be added, since
> > +      * callers may not know whether any objects will be
> > +      * added at the time they call begin_odb_transaction.
> > +      */
> > +     if (!odb_transaction_nesting || bulk_fsync_objdir)
> > +             return;
> > +
> > +     bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> > +     if (bulk_fsync_objdir)
> > +             tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
> > +}
>
> OK.  If we got a failure from tmp_objdir_create(), then we don't
> swap and end up creating a new loose object file in the primary
> object store.  I wonder if we at least want to note that fact for
> later use at "unplug" time.  We may create a few loose objects in
> the primary object store without any fsync, then a later call may
> successfully create a temporary object directory and we'd create
> more loose objects in the temporary one, which are flushed with the
> "create a dummy and fsync" trick and migrated, but do we need to do
> something to the ones we created in the primary object store before
> all that happens?
>
> > +void fsync_loose_object_bulk_checkin(int fd, const char *filename)
> > +{
> > +     /*
> > +      * If we have an active ODB transaction, we issue a call that
> > +      * cleans the filesystem page cache but avoids a hardware flush
> > +      * command. Later on we will issue a single hardware flush
> > +      * before as part of do_batch_fsync.
> > +      */
> > +     if (!bulk_fsync_objdir ||
> > +         git_fsync(fd, FSYNC_WRITEOUT_ONLY) < 0) {
> > +             fsync_or_die(fd, filename);
> > +     }
> > +}
>
> Ah, if we have successfully created the temporary directory, we
> don't do full fsync but just writeout-only one, so there is no need
> for the worry I mentioned in the previous paragraph.  OK.
>

There is the possibility that we might create the objdir when calling
prepare_loose_object_bulk_checkin but somehow fail to write the object
and yet still make it to end_odb_transaction.  In that case, we'd
issue an extra dummy fsync.  I don't think this case is worth extra
code to track, since it's a single fsync in a weird failure case.

> > @@ -301,4 +370,6 @@ void end_odb_transaction(void)
> >
> >       if (bulk_checkin_state.f)
> >               finish_bulk_checkin(&bulk_checkin_state);
> > +
> > +     do_batch_fsync();
> >  }
>
> OK.
>
> > @@ -1961,6 +1963,9 @@ static int write_loose_object(const struct object_id *oid, char *hdr,
> >       static struct strbuf tmp_file = STRBUF_INIT;
> >       static struct strbuf filename = STRBUF_INIT;
> >
> > +     if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
> > +             prepare_loose_object_bulk_checkin();
> > +
> >       loose_object_path(the_repository, &filename, oid);
> >
> >       fd = create_tmpfile(&tmp_file, filename.buf);
>
> The necessary change to the "workhorse" code path is surprisingly
> small, which is pleasing to see.
>
> Thanks.

Thanks for looking at this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Neeraj Singh <nksingh85@gmail.com> writes:

> To describe the above if it doesn't render correctly, we have a
> bulleted list where the batch after the * is bolded.  Other instances
> of single backtick quoted text just appears as plaintext. The
> descriptive "Currently `batch` mode..." paragraph is under the bullet
> point and well-indented.
>
> In HTML the output looks good as well, except that the descriptive
> paragraph is in monospace for some reason.

The "except" part admits that it does not render well, no?

What happens if you modify the second and subsequent paragraph after
the "+" continuation in the way suggested?  Does it make it worse,
or does it make it worse?

> Keeping fsync separate from packfile should help the reader see that
> the two sets of functions access only their respective global state.
> If we add another optimization strategy (e.g. appendable pack files),
> it would get its own separate state and functions that are independent
> of the large-blob packfile and loose-object fsync optimizations.

OK.

>> > +void fsync_loose_object_bulk_checkin(int fd, const char *filename)
>> > +{
>> > +     /*
>> > +      * If we have an active ODB transaction, we issue a call that
>> > +      * cleans the filesystem page cache but avoids a hardware flush
>> > +      * command. Later on we will issue a single hardware flush
>> > +      * before as part of do_batch_fsync.
>> > +      */
>> > +     if (!bulk_fsync_objdir ||
>> > +         git_fsync(fd, FSYNC_WRITEOUT_ONLY) < 0) {
>> > +             fsync_or_die(fd, filename);
>> > +     }
>> > +}
>>
>> Ah, if we have successfully created the temporary directory, we
>> don't do full fsync but just writeout-only one, so there is no need
>> for the worry I mentioned in the previous paragraph.  OK.
>
> There is the possibility that we might create the objdir when calling
> prepare_loose_object_bulk_checkin but somehow fail to write the object
> and yet still make it to end_odb_transaction.  In that case, we'd
> issue an extra dummy fsync.  I don't think this case is worth extra
> code to track, since it's a single fsync in a weird failure case.

Yup.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Thu, Mar 31, 2022 at 11:05 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> > To describe the above if it doesn't render correctly, we have a
> > bulleted list where the batch after the * is bolded.  Other instances
> > of single backtick quoted text just appears as plaintext. The
> > descriptive "Currently `batch` mode..." paragraph is under the bullet
> > point and well-indented.
> >
> > In HTML the output looks good as well, except that the descriptive
> > paragraph is in monospace for some reason.
>
> The "except" part admits that it does not render well, no?
>
> What happens if you modify the second and subsequent paragraph after
> the "+" continuation in the way suggested?  Does it make it worse,
> or does it make it worse?
>

Apologies, I misinterpreted your statement that the input "has to be
flushed left without indentation".  Now that I flushed it left I'm
getting better output where the follow-on paragraph has a "normal"
text style and interior backtick quoted things are bolded as expected.

This will be fixed in the next iteration.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Neeraj Singh <nksingh85@gmail.com> writes:

> ...  Now that I flushed it left I'm
> getting better output where the follow-on paragraph has a "normal"
> text style and interior backtick quoted things are bolded as expected.

I was reasonably but not absolutely sure if that works inside
bulletted list.  Thanks for experimenting it for us ;-)


filesystem and storage hardware, data added to the repository may not be
durable in the event of a system crash. This is the default mode on macOS.
* `batch` enables a mode that uses writeout-only flushes to stage multiple
updates in the disk writeback cache and then does a single full fsync of
a dummy file to trigger the disk cache flush at the end of the operation.
+
Currently `batch` mode only applies to loose-object files. Other repository
data is made durable as if `fsync` was specified. This mode is expected to
be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems
and on Windows for repos stored on NTFS or ReFS filesystems.

core.fsyncObjectFiles::
This boolean will enable 'fsync()' when writing object files.
Expand Down
13 changes: 11 additions & 2 deletions builtin/add.c
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,16 @@ int add_files_to_cache(const char *prefix,
rev.diffopt.format_callback_data = &data;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> The add_files_to_cache function is invoked internally by
> builtin/commit.c and builtin/checkout.c for their flags that stage
> modified files before doing the larger operation. These commands
> can benefit from batched fsyncing.
>
> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> ---
>  builtin/add.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/builtin/add.c b/builtin/add.c
> index 9bf37ceae8e..e39770e4746 100644
> --- a/builtin/add.c
> +++ b/builtin/add.c
> @@ -141,7 +141,16 @@ int add_files_to_cache(const char *prefix,
>  	rev.diffopt.format_callback_data = &data;
>  	rev.diffopt.flags.override_submodule_config = 1;
>  	rev.max_count = 0; /* do not compare unmerged paths with stage #2 */
> +
> +	/*
> +	 * Use an ODB transaction to optimize adding multiple objects.
> +	 * This function is invoked from commands other than 'add', which
> +	 * may not have their own transaction active.
> +	 */
> +	begin_odb_transaction();
>  	run_diff_files(&rev, DIFF_RACY_IS_MODIFIED);
> +	end_odb_transaction();
> +

This one clearly is "bulk".  Makes sense.

rev.diffopt.flags.override_submodule_config = 1;
rev.max_count = 0; /* do not compare unmerged paths with stage #2 */

/*
* Use an ODB transaction to optimize adding multiple objects.
* This function is invoked from commands other than 'add', which
* may not have their own transaction active.
*/
begin_odb_transaction();
run_diff_files(&rev, DIFF_RACY_IS_MODIFIED);
end_odb_transaction();

clear_pathspec(&rev.prune_data);
return !!data.add_errors;
}
Expand Down Expand Up @@ -670,7 +679,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
string_list_clear(&only_match_skip_worktree, 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Thu, Mar 24 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> Make it clearer in the naming and documentation of the plug_bulk_checkin
> and unplug_bulk_checkin APIs that they can be thought of as
> a "transaction" to optimize operations on the object database.
>
> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> ---
>  builtin/add.c  |  4 ++--
>  bulk-checkin.c |  4 ++--
>  bulk-checkin.h | 14 ++++++++++++--
>  3 files changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/add.c b/builtin/add.c
> index 3ffb86a4338..9bf37ceae8e 100644
> --- a/builtin/add.c
> +++ b/builtin/add.c
> @@ -670,7 +670,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>  		string_list_clear(&only_match_skip_worktree, 0);
>  	}
>  
> -	plug_bulk_checkin();
> +	begin_odb_transaction();
>  
>  	if (add_renormalize)
>  		exit_status |= renormalize_tracked_files(&pathspec, flags);
> @@ -682,7 +682,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>  
>  	if (chmod_arg && pathspec.nr)
>  		exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only);
> -	unplug_bulk_checkin();
> +	end_odb_transaction();

Aside from anything else we've (dis)agreed on, I found this part really
odd when hacking on my RFC-on-top, i.e. originally I (wrongly) thought
the plug_bulk_checkin() was something that originated with this series
which adds the "bulk" mode.

But no, on second inspection it's a thing Junio added a long time ago so
that in this case we "stream to N pack" where we'd otherwise add N loose
objects.

Which, and I think Junio brought this up in an earlier round, but I
didn't fully understand that at the time makes this whole thing quite
odd to me.

So first, shouldn't we add this begin_odb_transaction() as a new thing?
I.e. surely wanting to do that object target redirection within a given
begin/end "scope" should be orthagonal to how fsync() happens within
that "scope", though in this case that happens to correspond.

And secondly, per the commit message and comment when it was added in
(568508e7657 (bulk-checkin: replace fast-import based implementation,
2011-10-28)) is it something we need *for that purpose* with the series
to unpack-objects without malloc()ing the size of the blob[1].

And, if so and orthagonal to that: If we know how to either stream N
objects to a PACK (as fast-import does), *and* we now (or SOON) know how
to stream loose objects without using size(blob) amounts of memory,
doesn't the "optimize fsync()" rather want to make use of the
stream-to-pack approach?

I.e. have you tried for the caseses where we create say 1k objects for
"git stash" tried to stream those to a pack? How does that compare (both
with/without the fsync changes).

I.e. I do worry (also per [2]) that while the whole "bulk fsync" is neat
(and I think can use it in either case, to defer object syncs until the
"index" or "ref" sync, as my RFC does) I worry that we're adding a bunch
of configuration and complexity for something that:

 1. Ultimately isn't all that important, as already for part of it we
    can mostly configure it away. I.e. "git-unpack-objects" v.s. writing
    a pack, cf. transfer.unpackLimit)
 2. We don't have #1 for "add" and "update-index", but if we stream to
    packs there is there any remaining benefit in practice?

1. https://lore.kernel.org/git/cover-v11-0.8-00000000000-20220319T001411Z-avarab@gmail.com/
2. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Thu, Mar 24, 2022 at 9:24 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Thu, Mar 24 2022, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > Make it clearer in the naming and documentation of the plug_bulk_checkin
> > and unplug_bulk_checkin APIs that they can be thought of as
> > a "transaction" to optimize operations on the object database.
> >
> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> > ---
> >  builtin/add.c  |  4 ++--
> >  bulk-checkin.c |  4 ++--
> >  bulk-checkin.h | 14 ++++++++++++--
> >  3 files changed, 16 insertions(+), 6 deletions(-)
> >
> > diff --git a/builtin/add.c b/builtin/add.c
> > index 3ffb86a4338..9bf37ceae8e 100644
> > --- a/builtin/add.c
> > +++ b/builtin/add.c
> > @@ -670,7 +670,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
> >               string_list_clear(&only_match_skip_worktree, 0);
> >       }
> >
> > -     plug_bulk_checkin();
> > +     begin_odb_transaction();
> >
> >       if (add_renormalize)
> >               exit_status |= renormalize_tracked_files(&pathspec, flags);
> > @@ -682,7 +682,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
> >
> >       if (chmod_arg && pathspec.nr)
> >               exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only);
> > -     unplug_bulk_checkin();
> > +     end_odb_transaction();
>
> Aside from anything else we've (dis)agreed on, I found this part really
> odd when hacking on my RFC-on-top, i.e. originally I (wrongly) thought
> the plug_bulk_checkin() was something that originated with this series
> which adds the "bulk" mode.
>
> But no, on second inspection it's a thing Junio added a long time ago so
> that in this case we "stream to N pack" where we'd otherwise add N loose
> objects.
>
> Which, and I think Junio brought this up in an earlier round, but I
> didn't fully understand that at the time makes this whole thing quite
> odd to me.
>
> So first, shouldn't we add this begin_odb_transaction() as a new thing?
> I.e. surely wanting to do that object target redirection within a given
> begin/end "scope" should be orthagonal to how fsync() happens within
> that "scope", though in this case that happens to correspond.
>
> And secondly, per the commit message and comment when it was added in
> (568508e7657 (bulk-checkin: replace fast-import based implementation,
> 2011-10-28)) is it something we need *for that purpose* with the series
> to unpack-objects without malloc()ing the size of the blob[1].
>

The original change seems to be about optimizing addition of
successive large blobs to the ODB when we know we have a large batch.
It's a batch-mode optimization for the ODB, similar to my patch
series, just targeting large blobs rather than small blobs/trees.  It
also has the same property that the added data is "invisible" until
the transaction ends.

> And, if so and orthagonal to that: If we know how to either stream N
> objects to a PACK (as fast-import does), *and* we now (or SOON) know how
> to stream loose objects without using size(blob) amounts of memory,
> doesn't the "optimize fsync()" rather want to make use of the
> stream-to-pack approach?
>
> I.e. have you tried for the caseses where we create say 1k objects for
> "git stash" tried to stream those to a pack? How does that compare (both
> with/without the fsync changes).
>
> I.e. I do worry (also per [2]) that while the whole "bulk fsync" is neat
> (and I think can use it in either case, to defer object syncs until the
> "index" or "ref" sync, as my RFC does) I worry that we're adding a bunch
> of configuration and complexity for something that:
>
>  1. Ultimately isn't all that important, as already for part of it we
>     can mostly configure it away. I.e. "git-unpack-objects" v.s. writing
>     a pack, cf. transfer.unpackLimit)
>  2. We don't have #1 for "add" and "update-index", but if we stream to
>     packs there is there any remaining benefit in practice?
>
> 1. https://lore.kernel.org/git/cover-v11-0.8-00000000000-20220319T001411Z-avarab@gmail.com/
> 2. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/

Stream to pack is a good idea.  But I think we'd want a way to append
to the most recent pack so that we don't explode the number of packs,
which seems to impose a linear cost on ODB operations, at least to
load up the indexes.  I think this is orthogonal and we can always
change the meaning of batch mode to use a pack mechanism when such a
mechanism is ready.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> Make it clearer in the naming and documentation of the plug_bulk_checkin
> and unplug_bulk_checkin APIs that they can be thought of as
> a "transaction" to optimize operations on the object database. These
> transactions may be nested so that subsystems like the cache-tree
> writing code can optimize their operations without caring whether the
> top-level code has a transaction active.

I can see that "checkin" part of the name is too limiting (you may
want to do more than optimize checkin, e.g. fsync), and that you may
prefer "begin/end" over "plug/unplug", but I am not sure if we want
to limit ourselves to "odb".  If we find our code doing things on
many instances of something that are not objects (e.g. refs, config
variables), don't we want to give them the same chance to be optimized
by batching them?

{begin,end}_bulk_transaction perhaps?  I dunno.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 30, 2022 at 10:17 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > Make it clearer in the naming and documentation of the plug_bulk_checkin
> > and unplug_bulk_checkin APIs that they can be thought of as
> > a "transaction" to optimize operations on the object database. These
> > transactions may be nested so that subsystems like the cache-tree
> > writing code can optimize their operations without caring whether the
> > top-level code has a transaction active.
>
> I can see that "checkin" part of the name is too limiting (you may
> want to do more than optimize checkin, e.g. fsync), and that you may
> prefer "begin/end" over "plug/unplug", but I am not sure if we want
> to limit ourselves to "odb".  If we find our code doing things on
> many instances of something that are not objects (e.g. refs, config
> variables), don't we want to give them the same chance to be optimized
> by batching them?
>
> {begin,end}_bulk_transaction perhaps?  I dunno.

At least in the current code where the implementation of each
'database table' (odb, refs-db, config, index) is pretty separate, it
seems better to keep bulk-checkin.c and its plugging scoped to the
ODB.  Patrick's (who I previously misnamed as 'Peter') older patch at
https://lore.kernel.org/git/d9aa96913b1730f1d0c238d7d52e27c20bc55390.1636544377.git.ps@pks.im/
showed a pretty nice and concise implementation for refs tied to the
ref-transaction infrastructure.

Thanks,
Neeraj

}

plug_bulk_checkin();
begin_odb_transaction();

if (add_renormalize)
exit_status |= renormalize_tracked_files(&pathspec, flags);
Expand All @@ -682,7 +691,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)

if (chmod_arg && pathspec.nr)
exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only);
unplug_bulk_checkin();
end_odb_transaction();

finish:
if (write_locked_index(&the_index, &lock_file,
Expand Down
3 changes: 3 additions & 0 deletions builtin/unpack-objects.c
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#include "builtin.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> The unpack-objects functionality is used by fetch, push, and fast-import
> to turn the transfered data into object database entries when there are
> fewer objects than the 'unpacklimit' setting.
>
> By enabling bulk-checkin when unpacking objects, we can take advantage
> of batched fsyncs.

This feels confused in that we dispatch to unpack-objects (instead
of index-objects) only when the number of loose objects should not
matter from performance point of view, and bulk-checkin should shine
from performance point of view only when there are enough objects to
batch.

Also if we ever add "too many small loose objects is wasteful, let's
send them into a single 'batch pack'" optimization, it would create
a funny situation where the caller sends the contents of a small
incoming packfile to unpack-objects, but the command chooses to
bunch them all together in a packfile anyway ;-)

So, I dunno.


> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> ---
>  builtin/unpack-objects.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/builtin/unpack-objects.c b/builtin/unpack-objects.c
> index dbeb0680a58..c55b6616aed 100644
> --- a/builtin/unpack-objects.c
> +++ b/builtin/unpack-objects.c
> @@ -1,5 +1,6 @@
>  #include "builtin.h"
>  #include "cache.h"
> +#include "bulk-checkin.h"
>  #include "config.h"
>  #include "object-store.h"
>  #include "object.h"
> @@ -503,10 +504,12 @@ static void unpack_all(void)
>  	if (!quiet)
>  		progress = start_progress(_("Unpacking objects"), nr_objects);
>  	CALLOC_ARRAY(obj_list, nr_objects);
> +	plug_bulk_checkin();
>  	for (i = 0; i < nr_objects; i++) {
>  		unpack_one(i);
>  		display_progress(progress, i + 1);
>  	}
> +	unplug_bulk_checkin();
>  	stop_progress(&progress);
>  
>  	if (delta_list)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Mon, Mar 21, 2022 at 10:55 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > The unpack-objects functionality is used by fetch, push, and fast-import
> > to turn the transfered data into object database entries when there are
> > fewer objects than the 'unpacklimit' setting.
> >
> > By enabling bulk-checkin when unpacking objects, we can take advantage
> > of batched fsyncs.
>
> This feels confused in that we dispatch to unpack-objects (instead
> of index-objects) only when the number of loose objects should not
> matter from performance point of view, and bulk-checkin should shine
> from performance point of view only when there are enough objects to
> batch.
>
> Also if we ever add "too many small loose objects is wasteful, let's
> send them into a single 'batch pack'" optimization, it would create
> a funny situation where the caller sends the contents of a small
> incoming packfile to unpack-objects, but the command chooses to
> bunch them all together in a packfile anyway ;-)
>
> So, I dunno.
>

I'd be happy to just drop this patch.  I originally added it to answer Avarab's
question: how does batch mode compare to packfiles? [1] [2].

[1] https://lore.kernel.org/git/87mtp5cwpn.fsf@evledraar.gmail.com/
[2] https://lore.kernel.org/git/pull.1076.v5.git.git.1632514331.gitgitgadget@gmail.com/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Mon, Mar 21, 2022 at 4:02 PM Neeraj Singh <nksingh85@gmail.com> wrote:
>
> On Mon, Mar 21, 2022 at 10:55 AM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >
> > > From: Neeraj Singh <neerajsi@microsoft.com>
> > >
> > > The unpack-objects functionality is used by fetch, push, and fast-import
> > > to turn the transfered data into object database entries when there are
> > > fewer objects than the 'unpacklimit' setting.
> > >
> > > By enabling bulk-checkin when unpacking objects, we can take advantage
> > > of batched fsyncs.
> >
> > This feels confused in that we dispatch to unpack-objects (instead
> > of index-objects) only when the number of loose objects should not
> > matter from performance point of view, and bulk-checkin should shine
> > from performance point of view only when there are enough objects to
> > batch.
> >
> > Also if we ever add "too many small loose objects is wasteful, let's
> > send them into a single 'batch pack'" optimization, it would create
> > a funny situation where the caller sends the contents of a small
> > incoming packfile to unpack-objects, but the command chooses to
> > bunch them all together in a packfile anyway ;-)
> >
> > So, I dunno.
> >
>
> I'd be happy to just drop this patch.  I originally added it to answer Avarab's
> question: how does batch mode compare to packfiles? [1] [2].
>
> [1] https://lore.kernel.org/git/87mtp5cwpn.fsf@evledraar.gmail.com/
> [2] https://lore.kernel.org/git/pull.1076.v5.git.git.1632514331.gitgitgadget@gmail.com/

Well looking back again at the spreadsheet [3], at 90 objects, which
is below the
default transfer.unpackLimit, we see a 3x difference in performance
between batch
mode and the default fsync mode.  That's a different interaction class
(230 ms versus 760 ms).

I'll include a small table in the commit description with these
performance numbers to
help justify it.

[3] https://docs.google.com/spreadsheets/d/1uxMBkEXFFnQ1Y3lXKqcKpw6Mq44BzhpCAcPex14T-QQ

#include "cache.h"
#include "bulk-checkin.h"
#include "config.h"
#include "object-store.h"
#include "object.h"
Expand Down Expand Up @@ -503,10 +504,12 @@ static void unpack_all(void)
if (!quiet)
progress = start_progress(_("Unpacking objects"), nr_objects);
CALLOC_ARRAY(obj_list, nr_objects);
begin_odb_transaction();
for (i = 0; i < nr_objects; i++) {
unpack_one(i);
display_progress(progress, i + 1);
}
end_odb_transaction();
stop_progress(&progress);

if (delta_list)
Expand Down
20 changes: 20 additions & 0 deletions builtin/update-index.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
*/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> The update-index functionality is used internally by 'git stash push' to
> setup the internal stashed commit.
>
> This change enables bulk-checkin for update-index infrastructure to
> speed up adding new objects to the object database by leveraging the
> batch fsync functionality.
>
> There is some risk with this change, since under batch fsync, the object
> files will be in a tmp-objdir until update-index is complete.  This
> usage is unlikely, since any tool invoking update-index and expecting to
> see objects would have to synchronize with the update-index process
> after passing it a file path.
>
> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> ---
>  builtin/update-index.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index 75d646377cc..38e9d7e88cb 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -5,6 +5,7 @@
>   */
>  #define USE_THE_INDEX_COMPATIBILITY_MACROS
>  #include "cache.h"
> +#include "bulk-checkin.h"
>  #include "config.h"
>  #include "lockfile.h"
>  #include "quote.h"
> @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>  
>  	the_index.updated_skipworktree = 1;
>  
> +	/* we might be adding many objects to the object database */
> +	plug_bulk_checkin();
> +

Shouldn't this be after parse_options_start()?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Mon, Mar 21, 2022 at 8:04 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > The update-index functionality is used internally by 'git stash push' to
> > setup the internal stashed commit.
> >
> > This change enables bulk-checkin for update-index infrastructure to
> > speed up adding new objects to the object database by leveraging the
> > batch fsync functionality.
> >
> > There is some risk with this change, since under batch fsync, the object
> > files will be in a tmp-objdir until update-index is complete.  This
> > usage is unlikely, since any tool invoking update-index and expecting to
> > see objects would have to synchronize with the update-index process
> > after passing it a file path.
> >
> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> > ---
> >  builtin/update-index.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/builtin/update-index.c b/builtin/update-index.c
> > index 75d646377cc..38e9d7e88cb 100644
> > --- a/builtin/update-index.c
> > +++ b/builtin/update-index.c
> > @@ -5,6 +5,7 @@
> >   */
> >  #define USE_THE_INDEX_COMPATIBILITY_MACROS
> >  #include "cache.h"
> > +#include "bulk-checkin.h"
> >  #include "config.h"
> >  #include "lockfile.h"
> >  #include "quote.h"
> > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >
> >       the_index.updated_skipworktree = 1;
> >
> > +     /* we might be adding many objects to the object database */
> > +     plug_bulk_checkin();
> > +
>
> Shouldn't this be after parse_options_start()?

Does it make a difference?  Especially if we do the object dir creation lazily?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Mon, Mar 21 2022, Neeraj Singh wrote:

> On Mon, Mar 21, 2022 at 8:04 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:
>>
>> > From: Neeraj Singh <neerajsi@microsoft.com>
>> >
>> > The update-index functionality is used internally by 'git stash push' to
>> > setup the internal stashed commit.
>> >
>> > This change enables bulk-checkin for update-index infrastructure to
>> > speed up adding new objects to the object database by leveraging the
>> > batch fsync functionality.
>> >
>> > There is some risk with this change, since under batch fsync, the object
>> > files will be in a tmp-objdir until update-index is complete.  This
>> > usage is unlikely, since any tool invoking update-index and expecting to
>> > see objects would have to synchronize with the update-index process
>> > after passing it a file path.
>> >
>> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
>> > ---
>> >  builtin/update-index.c | 6 ++++++
>> >  1 file changed, 6 insertions(+)
>> >
>> > diff --git a/builtin/update-index.c b/builtin/update-index.c
>> > index 75d646377cc..38e9d7e88cb 100644
>> > --- a/builtin/update-index.c
>> > +++ b/builtin/update-index.c
>> > @@ -5,6 +5,7 @@
>> >   */
>> >  #define USE_THE_INDEX_COMPATIBILITY_MACROS
>> >  #include "cache.h"
>> > +#include "bulk-checkin.h"
>> >  #include "config.h"
>> >  #include "lockfile.h"
>> >  #include "quote.h"
>> > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>> >
>> >       the_index.updated_skipworktree = 1;
>> >
>> > +     /* we might be adding many objects to the object database */
>> > +     plug_bulk_checkin();
>> > +
>>
>> Shouldn't this be after parse_options_start()?
>
> Does it make a difference?  Especially if we do the object dir creation lazily?

I think it won't matter for the machine, but it helps with readability
to keep code like this as close to where it's used as possible.

Close enough and we'd also spot the other bug I mentioned here,
i.e. that we're setting this up where we're not writing objects at all
:)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index 75d646377cc..38e9d7e88cb 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -5,6 +5,7 @@
>   */
>  #define USE_THE_INDEX_COMPATIBILITY_MACROS
>  #include "cache.h"
> +#include "bulk-checkin.h"
>  #include "config.h"
>  #include "lockfile.h"
>  #include "quote.h"
> @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>  
>  	the_index.updated_skipworktree = 1;
>  
> +	/* we might be adding many objects to the object database */
> +	plug_bulk_checkin();
> +
>  	/*
>  	 * Custom copy of parse_options() because we want to handle
>  	 * filename arguments as they come.
> @@ -1190,6 +1194,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>  		strbuf_release(&buf);
>  	}
>  
> +	/* by now we must have added all of the new objects */
> +	unplug_bulk_checkin();

I understand read-from-stdin code path would be worth plugging, but
the list of paths on the command line?  How many of them would one
fit?

Of course, the feeder may be expecting for the objects to appear in
the object store as it feeds the paths and will be utterly broken by
this change, as you mentioned in the proposed log message.  The
existing plug/unplug will change the behaviour by making the objects
sent to the packfile available only after getting unplugged.  This
series makes it even worse by making loose objects also unavailable
until unplug is called.

So, it probably is safer and more sensible approach to introduce a
new command line option to allow the bulk checkin, and those who do
not care about the intermediate state to opt into the new feature.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Mon, Mar 21, 2022 at 10:50 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > diff --git a/builtin/update-index.c b/builtin/update-index.c
> > index 75d646377cc..38e9d7e88cb 100644
> > --- a/builtin/update-index.c
> > +++ b/builtin/update-index.c
> > @@ -5,6 +5,7 @@
> >   */
> >  #define USE_THE_INDEX_COMPATIBILITY_MACROS
> >  #include "cache.h"
> > +#include "bulk-checkin.h"
> >  #include "config.h"
> >  #include "lockfile.h"
> >  #include "quote.h"
> > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >
> >       the_index.updated_skipworktree = 1;
> >
> > +     /* we might be adding many objects to the object database */
> > +     plug_bulk_checkin();
> > +
> >       /*
> >        * Custom copy of parse_options() because we want to handle
> >        * filename arguments as they come.
> > @@ -1190,6 +1194,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >               strbuf_release(&buf);
> >       }
> >
> > +     /* by now we must have added all of the new objects */
> > +     unplug_bulk_checkin();
>
> I understand read-from-stdin code path would be worth plugging, but
> the list of paths on the command line?  How many of them would one
> fit?
>

do_reupdate could touch all the files in the index.  Also one can pass a
directory, and re-add all files under the directory.

> Of course, the feeder may be expecting for the objects to appear in
> the object store as it feeds the paths and will be utterly broken by
> this change, as you mentioned in the proposed log message.  The
> existing plug/unplug will change the behaviour by making the objects
> sent to the packfile available only after getting unplugged.  This
> series makes it even worse by making loose objects also unavailable
> until unplug is called.
>
> So, it probably is safer and more sensible approach to introduce a
> new command line option to allow the bulk checkin, and those who do
> not care about the intermediate state to opt into the new feature.
>

I don't believe this usage is likely today. How would the feeder know when
it can expect to find an object in the object directory after passing something
on stdin?  When fed via stdin, git-update-index will asynchronously add that
object to the object database, leaving no indication to the feeder of when it
actually happens, aside from it happening before the git-update-index process
terminates.  I used to have a comment here about the feeder being able to
parse the --verbose output to get feedback from git-update-index, which
would be quite tricky. I thought it was unnecessarily detailed.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static void end_odb_transaction_if_active(void)
> +{
> +	if (!odb_transaction_active)
> +		return;
> +
> +	end_odb_transaction();
> +	odb_transaction_active = 0;
> +}

>  __attribute__((format (printf, 1, 2)))
>  static void report(const char *fmt, ...)
>  {
> @@ -57,6 +68,16 @@ static void report(const char *fmt, ...)
>  	if (!verbose)
>  		return;
>  
> +	/*
> +	 * It is possible, though unlikely, that a caller
> +	 * could use the verbose output to synchronize with
> +	 * addition of objects to the object database, so
> +	 * unplug bulk checkin to make sure that future objects
> +	 * are immediately visible.
> +	 */
> +
> +	end_odb_transaction_if_active();
> +
>  	va_start(vp, fmt);
>  	vprintf(fmt, vp);
>  	putchar('\n');
> @@ -1116,6 +1137,13 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>  	 */
>  	parse_options_start(&ctx, argc, argv, prefix,
>  			    options, PARSE_OPT_STOP_AT_NON_OPTION);
> +
> +	/*
> +	 * Allow the object layer to optimize adding multiple objects in
> +	 * a batch.
> +	 */
> +	begin_odb_transaction();
> +	odb_transaction_active = 1;

This looks strange.  Shouldn't begin/end pair be responsible for
knowing if there is a transaction active already?  For that matter,
didn't the original unplug in plug/unplug pair automatically turned
into no-op when it is already unplugged?

IOW, I am not sure end_if_active() should exist in the first place.
Shouldn't end_transaction() do that instead?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Thu, Mar 24, 2022 at 11:18 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +static void end_odb_transaction_if_active(void)
> > +{
> > +     if (!odb_transaction_active)
> > +             return;
> > +
> > +     end_odb_transaction();
> > +     odb_transaction_active = 0;
> > +}
>
> >  __attribute__((format (printf, 1, 2)))
> >  static void report(const char *fmt, ...)
> >  {
> > @@ -57,6 +68,16 @@ static void report(const char *fmt, ...)
> >       if (!verbose)
> >               return;
> >
> > +     /*
> > +      * It is possible, though unlikely, that a caller
> > +      * could use the verbose output to synchronize with
> > +      * addition of objects to the object database, so
> > +      * unplug bulk checkin to make sure that future objects
> > +      * are immediately visible.
> > +      */
> > +
> > +     end_odb_transaction_if_active();
> > +
> >       va_start(vp, fmt);
> >       vprintf(fmt, vp);
> >       putchar('\n');
> > @@ -1116,6 +1137,13 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >        */
> >       parse_options_start(&ctx, argc, argv, prefix,
> >                           options, PARSE_OPT_STOP_AT_NON_OPTION);
> > +
> > +     /*
> > +      * Allow the object layer to optimize adding multiple objects in
> > +      * a batch.
> > +      */
> > +     begin_odb_transaction();
> > +     odb_transaction_active = 1;
>
> This looks strange.  Shouldn't begin/end pair be responsible for
> knowing if there is a transaction active already?  For that matter,
> didn't the original unplug in plug/unplug pair automatically turned
> into no-op when it is already unplugged?
>
> IOW, I am not sure end_if_active() should exist in the first place.
> Shouldn't end_transaction() do that instead?
>

Today there's an "assert(bulk_checkin_plugged)" in
end_odb_transaction. In principle we could just drop the assert and
allow a transaction to be ended multiple times.  But maybe in the long
run for composability we'd like to have nested callers to begin/end
transaction (e.g. we could have a nested transaction around writing
the cache tree to the ODB to minimize fsyncs there).  In that world,
having a subsystem not maintain a balanced pairing could be a problem.
An alternative API here could be to have an "flush_odb_transaction"
call to make the objects visible at this point.  Lastly, I could take
your original suggested approach of adding a new flag to update-index.
I preferred the unplug-on-verbose approach since it would
automatically optimize most callers to update-index that might exist
in the wild, without users having to change anything.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Neeraj Singh <nksingh85@gmail.com> writes:

>> IOW, I am not sure end_if_active() should exist in the first place.
>> Shouldn't end_transaction() do that instead?
>>
>
> Today there's an "assert(bulk_checkin_plugged)" in
> end_odb_transaction. In principle we could just drop the assert and
> allow a transaction to be ended multiple times.  But maybe in the long
> run for composability we'd like to have nested callers to begin/end
> transaction (e.g. we could have a nested transaction around writing
> the cache tree to the ODB to minimize fsyncs there).

I am not convinced that "transaction" is a good mental model for
this mechanism to begin with, in the sense that the sense that it is
not a bug or failure of the implementation if two or more operations
in the same <begin,end> bracket did not happen (or not happen)
atomically, or if 'begin' and 'end' were not properly nested.  With
the design getting more complex with things like tentative object
store that needs to be explicitly migrated after the outermost level
of end-transaction, we may end up _requiring_ that sufficient number
of 'end' must come once we issued 'begin', which I am not sure is
necessarily a good thing.

In any case, we aspire/envision to have a nested plug/unplug, I
think it is a good thing.  A helper for one subsystem may have its
large batch of operations inside plug/unplug pair, another help may
do the same, and the caller of these two helpers may want to say

	plug
		call helper A
			A does plug
			A does many things
			A does unplug
		call helper B
			B does plug
			B does many things
			B does unplug
	unplug

to "cancel" the unplug helper A and B has.

> In that world,
> having a subsystem not maintain a balanced pairing could be a problem.

And in such a world, you never want to have end-if-active to
implement what you are doing here, as you may end up being not
properly nested:

	begin
		begin
			do many things
			if some condtion
				end_if_active
			do more things
		end
	end

> An alternative API here could be to have an "flush_odb_transaction"
> call to make the objects visible at this point.

Yes, what you want is a forced-flush instead, I think.

So I suspect you'd want these three primitives, perhaps?

 * begin increments the nesting level
   - if outermost, you may have to do real "setup" things
   - otherwise, you may not have anything other than just counting
     the nesting level

 * flush implements unplug, fsync, etc. and does so immediately,
   even when plugged.

 * end decrements the nesting level
   - if outermost, you'd do "flush".
   - otherwise, you may only count the nesting level and do nothing else,
     but doing "flush" when you realize that you've queued too many
     is not a bug or a crime.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Thu, Mar 24, 2022 at 2:34 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> >> IOW, I am not sure end_if_active() should exist in the first place.
> >> Shouldn't end_transaction() do that instead?
> >>
> >
> > Today there's an "assert(bulk_checkin_plugged)" in
> > end_odb_transaction. In principle we could just drop the assert and
> > allow a transaction to be ended multiple times.  But maybe in the long
> > run for composability we'd like to have nested callers to begin/end
> > transaction (e.g. we could have a nested transaction around writing
> > the cache tree to the ODB to minimize fsyncs there).
>
> I am not convinced that "transaction" is a good mental model for
> this mechanism to begin with, in the sense that the sense that it is
> not a bug or failure of the implementation if two or more operations
> in the same <begin,end> bracket did not happen (or not happen)
> atomically, or if 'begin' and 'end' were not properly nested.  With
> the design getting more complex with things like tentative object
> store that needs to be explicitly migrated after the outermost level
> of end-transaction, we may end up _requiring_ that sufficient number
> of 'end' must come once we issued 'begin', which I am not sure is
> necessarily a good thing.

I don't love the tentative object store that keeps things invisble,
but that was the safest way to maintain the invariant that no
loose-object name appears in the ODB without durable contents.  I
think we want the "durability/ordering boundary" part of database
transactions without necessarily needing full abort/commit semantics.
As you say, we don't need full atomicity, but we do need ordering to
ensure that blobs are durable before trees pointing them, and so on up
the merkle chain.  The begin/end pairs help us defer the syncs
required for ordering to the end rather than pessimistically assuming
that every object write is the end.

> In any case, we aspire/envision to have a nested plug/unplug, I
> think it is a good thing.  A helper for one subsystem may have its
> large batch of operations inside plug/unplug pair, another help may
> do the same, and the caller of these two helpers may want to say
>
>         plug
>                 call helper A
>                         A does plug
>                         A does many things
>                         A does unplug
>                 call helper B
>                         B does plug
>                         B does many things
>                         B does unplug
>         unplug
>
> to "cancel" the unplug helper A and B has.
>
> > In that world,
> > having a subsystem not maintain a balanced pairing could be a problem.
>
> And in such a world, you never want to have end-if-active to
> implement what you are doing here, as you may end up being not
> properly nested:
>
>         begin
>                 begin
>                         do many things
>                         if some condtion
>                                 end_if_active
>                         do more things
>                 end
>         end
>
> > An alternative API here could be to have an "flush_odb_transaction"
> > call to make the objects visible at this point.
>
> Yes, what you want is a forced-flush instead, I think.
>
> So I suspect you'd want these three primitives, perhaps?
>
>  * begin increments the nesting level
>    - if outermost, you may have to do real "setup" things
>    - otherwise, you may not have anything other than just counting
>      the nesting level
>
>  * flush implements unplug, fsync, etc. and does so immediately,
>    even when plugged.
>
>  * end decrements the nesting level
>    - if outermost, you'd do "flush".
>    - otherwise, you may only count the nesting level and do nothing else,
>      but doing "flush" when you realize that you've queued too many
>      is not a bug or a crime.
>

Yes, I'll move in this direction. Thanks for the feedback.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	/*
> +	 * Allow the object layer to optimize adding multiple objects in
> +	 * a batch.
> +	 */
> +	begin_odb_transaction();
>  	while (ctx.argc) {
>  		if (parseopt_state != PARSE_OPT_DONE)
>  			parseopt_state = parse_options_step(&ctx, options,
> @@ -1167,6 +1174,17 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>  		the_index.version = preferred_index_format;
>  	}
>  
> +	/*
> +	 * It is possible, though unlikely, that a caller could use the verbose
> +	 * output to synchronize with addition of objects to the object
> +	 * database. The current implementation of ODB transactions leaves
> +	 * objects invisible while a transaction is active, so end the
> +	 * transaction here if verbose output is enabled.
> +	 */
> +
> +	if (verbose)
> +		end_odb_transaction();
> +
>  	if (read_from_stdin) {
>  		struct strbuf buf = STRBUF_INIT;
>  		struct strbuf unquoted = STRBUF_INIT;
> @@ -1190,6 +1208,12 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>  		strbuf_release(&buf);
>  	}
>  
> +	/*
> +	 * By now we have added all of the new objects
> +	 */
> +	if (!verbose)
> +		end_odb_transaction();

If we had "flush" in addition to "begin" and "end", then we could,
instead of the above

    begin_transaction
	do things
    if condition:
	end_transaction
    loop:
	do thing
    if !condition:
	end_transaction


which is somewhat hard to follow and maintain, consider using a
different flow, which is

    begin_transaction
	do things
    loop:
	do thing
	if condition:
	    flush
    end_transaction

and that might make it easier to follow and maintain.  I am not 100%
sure if it is worth it, but I am leaning to say it would be.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 30, 2022 at 10:52 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +     /*
> > +      * Allow the object layer to optimize adding multiple objects in
> > +      * a batch.
> > +      */
> > +     begin_odb_transaction();
> >       while (ctx.argc) {
> >               if (parseopt_state != PARSE_OPT_DONE)
> >                       parseopt_state = parse_options_step(&ctx, options,
> > @@ -1167,6 +1174,17 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >               the_index.version = preferred_index_format;
> >       }
> >
> > +     /*
> > +      * It is possible, though unlikely, that a caller could use the verbose
> > +      * output to synchronize with addition of objects to the object
> > +      * database. The current implementation of ODB transactions leaves
> > +      * objects invisible while a transaction is active, so end the
> > +      * transaction here if verbose output is enabled.
> > +      */
> > +
> > +     if (verbose)
> > +             end_odb_transaction();
> > +
> >       if (read_from_stdin) {
> >               struct strbuf buf = STRBUF_INIT;
> >               struct strbuf unquoted = STRBUF_INIT;
> > @@ -1190,6 +1208,12 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >               strbuf_release(&buf);
> >       }
> >
> > +     /*
> > +      * By now we have added all of the new objects
> > +      */
> > +     if (!verbose)
> > +             end_odb_transaction();
>
> If we had "flush" in addition to "begin" and "end", then we could,
> instead of the above
>
>     begin_transaction
>         do things
>     if condition:
>         end_transaction
>     loop:
>         do thing
>     if !condition:
>         end_transaction
>
>
> which is somewhat hard to follow and maintain, consider using a
> different flow, which is
>
>     begin_transaction
>         do things
>     loop:
>         do thing
>         if condition:
>             flush
>     end_transaction
>
> and that might make it easier to follow and maintain.  I am not 100%
> sure if it is worth it, but I am leaning to say it would be.
>
> Thanks.

I thought about this, but I was somewhat worried about the extra cost
of "flushing" versus just making things visible immediately if the
top-level update-index command knows it's going to need objects to be
visible.  I'll go ahead and implement flush in the next iteration.

#define USE_THE_INDEX_COMPATIBILITY_MACROS
#include "cache.h"
#include "bulk-checkin.h"
#include "config.h"
#include "lockfile.h"
#include "quote.h"
Expand Down Expand Up @@ -57,6 +58,14 @@ static void report(const char *fmt, ...)
if (!verbose)
return;

/*
* It is possible, though unlikely, that a caller could use the verbose
* output to synchronize with addition of objects to the object
* database. The current implementation of ODB transactions leaves
* objects invisible while a transaction is active, so flush the
* transaction here before reporting a change made by update-index.
*/
flush_odb_transaction();
va_start(vp, fmt);
vprintf(fmt, vp);
putchar('\n');
Expand Down Expand Up @@ -1116,6 +1125,12 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
*/
parse_options_start(&ctx, argc, argv, prefix,
options, PARSE_OPT_STOP_AT_NON_OPTION);

/*
* Allow the object layer to optimize adding multiple objects in
* a batch.
*/
begin_odb_transaction();
while (ctx.argc) {
if (parseopt_state != PARSE_OPT_DONE)
parseopt_state = parse_options_step(&ctx, options,
Expand Down Expand Up @@ -1190,6 +1205,11 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
strbuf_release(&buf);
}

/*
* By now we have added all of the new objects
*/
end_odb_transaction();

if (split_index > 0) {
if (git_config_get_split_index() == 0)
warning(_("core.splitIndex is set to false; "
Expand Down
116 changes: 98 additions & 18 deletions bulk-checkin.c
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,21 @@
*/
#include "cache.h"
#include "bulk-checkin.h"
#include "lockfile.h"
#include "repository.h"
#include "csum-file.h"
#include "pack.h"
#include "strbuf.h"
#include "string-list.h"
#include "tmp-objdir.h"
#include "packfile.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure.
>
> * Rename 'state' variable to 'bulk_checkin_state', since we will later
>   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
>   find in the debugger, since the name is more unique.
>
> * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
>   static variable. Doing this avoids resetting the variable in
>   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
>   seem to unintentionally disable the plugging functionality the first
>   time a new packfile must be created due to packfile size limits. While
>   disabling the plugging state only results in suboptimal behavior for
>   the current code, it would be fatal for the bulk-fsync functionality
>   later in this patch series.

Sorry, but I am confused.  The bulk-checkin infrastructure is there
so that we can send many little objects into a single packfile
instead of creating many little loose object files.  Everything we
throw at object-file.c::index_stream() will be concatenated into the
single packfile while we are "plugged" until we get "unplugged".

My understanding of what you are doing in this series is to still
create many little loose object files, but avoid the overhead of
having to fsync them individually.  And I am not sure how well the
original idea behind the bulk-checkin infrastructure to avoid
overhead of having to create many loose objects by creating a single
packfile (and presumably having to fsync at the end, but that is
just a single .pack file) with your goal of still creating many
loose object files but synching them more efficiently.

Is it just the new feature is piggybacking on the existing bulk
checkin infrastructure, even though these two have nothing in
common?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Tue, Mar 15, 2022 at 10:33 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > Preparation for adding bulk-fsync to the bulk-checkin.c infrastructure.
> >
> > * Rename 'state' variable to 'bulk_checkin_state', since we will later
> >   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
> >   find in the debugger, since the name is more unique.
> >
> > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
> >   static variable. Doing this avoids resetting the variable in
> >   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
> >   seem to unintentionally disable the plugging functionality the first
> >   time a new packfile must be created due to packfile size limits. While
> >   disabling the plugging state only results in suboptimal behavior for
> >   the current code, it would be fatal for the bulk-fsync functionality
> >   later in this patch series.
>
> Sorry, but I am confused.  The bulk-checkin infrastructure is there
> so that we can send many little objects into a single packfile
> instead of creating many little loose object files.  Everything we
> throw at object-file.c::index_stream() will be concatenated into the
> single packfile while we are "plugged" until we get "unplugged".
>

I noticed that you invented bulk-checkin back in 2011, but I don't think your
description matches what the code actually does.  index_bulk_checkin
is only called from index_stream, which is only called from index_fd. index_fd
goes down the index_bulk_checkin path for large files (512MB by default). It
looks like the effect of the 'plug/unplug' code is to allow multiple
large blobs to
go into a single packfile rather than each getting one getting its own separate
packfile.

> My understanding of what you are doing in this series is to still
> create many little loose object files, but avoid the overhead of
> having to fsync them individually.  And I am not sure how well the
> original idea behind the bulk-checkin infrastructure to avoid
> overhead of having to create many loose objects by creating a single
> packfile (and presumably having to fsync at the end, but that is
> just a single .pack file) with your goal of still creating many
> loose object files but synching them more efficiently.
>
> Is it just the new feature is piggybacking on the existing bulk
> checkin infrastructure, even though these two have nothing in
> common?
>

I think my new usage is congruent with the existing API, which seems
to be about combining multiple add operations into a large transaction,
where we can do some cleanup operations once we're finished. In the
preexisting code, the transaction is about adding a bunch of large objects
to a single pack file (while leaving small objects loose), and then completing
the packfile when the adds are finished.

---
On a side note, I've also been thinking about how we could use a packfile
approach as an alternative means to achieve faster addition of many small
objects. It's essentially what you stated above, where we'd send our
little objects
into a pack file. But to avoid frequent repacking overhead, we might
want to reuse
the 'latest' packfile across multiple Git invocations by appending
objects to it, with
an fsync on the file at the end.

We'd need sufficient padding between objects created by different Git
invocations to
ensure that previously synced data doesn't get disturbed by later
operations.  We'd
need to rewrite the pack indexes each time, but that's at least
derived metadata, so it
doesn't need to be fsynced. To make the pack indexes more
incrementally-updatable,
we might want to have the fanout table be checksummed, with
checksummed pointers to
leaf blocks. If we detect corruption during an index lookup, we could
recreate the index
from the packfile.

Essentially the above proposal is to move away from storing loose
objects in the filesystem
and instead to index the data within Git itself.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Neeraj Singh <nksingh85@gmail.com> writes:

> I think my new usage is congruent with the existing API, which seems
> to be about combining multiple add operations into a large transaction,
> where we can do some cleanup operations once we're finished. In the
> preexisting code, the transaction is about adding a bunch of large objects
> to a single pack file (while leaving small objects loose), and then completing
> the packfile when the adds are finished.

OK, so it was part me, and part a suboptimal presentation, I guess
;-)

Let me rephrase the idea to see if I got it right this time.

The bulk-checkin API has two interesting entry points, "plug" that
signals that we are about to repeat possibly many operations to add
new objects to the object store, and "unplug" that signals that we
are done such adding.  They are meant to serve as a hint for the
object layer to optimize its operation.

So far the only way the hint was used was that the logic that sends
an overly large object into a packfile (instead of storing it loose,
which leaves it subject to expensive repacking later) can shove more
than one such objects in the same packfile.

This series invents another use of the "plug"-"unplug" hint.  By
knowing that many loose object files are created and when the series
of object creation ended, we can avoid having to fsync each and
every one of them on certain filesystems and achieve the same
robustness.  The new "batch" option to core.fsyncmethod triggers
this mechanism.

Did I get it right, more-or-less?

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 16, 2022 at 9:14 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> > I think my new usage is congruent with the existing API, which seems
> > to be about combining multiple add operations into a large transaction,
> > where we can do some cleanup operations once we're finished. In the
> > preexisting code, the transaction is about adding a bunch of large objects
> > to a single pack file (while leaving small objects loose), and then completing
> > the packfile when the adds are finished.
>
> OK, so it was part me, and part a suboptimal presentation, I guess
> ;-)
>
> Let me rephrase the idea to see if I got it right this time.
>
> The bulk-checkin API has two interesting entry points, "plug" that
> signals that we are about to repeat possibly many operations to add
> new objects to the object store, and "unplug" that signals that we
> are done such adding.  They are meant to serve as a hint for the
> object layer to optimize its operation.
>
> So far the only way the hint was used was that the logic that sends
> an overly large object into a packfile (instead of storing it loose,
> which leaves it subject to expensive repacking later) can shove more
> than one such objects in the same packfile.
>
> This series invents another use of the "plug"-"unplug" hint.  By
> knowing that many loose object files are created and when the series
> of object creation ended, we can avoid having to fsync each and
> every one of them on certain filesystems and achieve the same
> robustness.  The new "batch" option to core.fsyncmethod triggers
> this mechanism.
>
> Did I get it right, more-or-less?

Yes, that's my understanding as well.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Neeraj Singh <nksingh85@gmail.com> writes:

>> Did I get it right, more-or-less?
>
> Yes, that's my understanding as well.

I guess what I wrote would make a useful material for early part of
the log message to help future developers.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 16, 2022 at 11:10 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> >> Did I get it right, more-or-less?
> >
> > Yes, that's my understanding as well.
>
> I guess what I wrote would make a useful material for early part of
> the log message to help future developers.
>
> Thanks.

Will do.  I changed the commit message to explain the current
functionality of bulk-checkin and how it's similar to batched-fsync.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Batched fsync will fit into bulk-checkin by taking advantage of the
> plug/unplug functionality to determine the appropriate time to fsync
> and make newly-added objects available in the primary object database.
>
> * Rename 'state' variable to 'bulk_checkin_state', since we will later
>   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
>   find in the debugger, since the name is more unique.
>
> * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
>   static variable. Doing this avoids resetting the variable in
>   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
>   seem to unintentionally disable the plugging functionality the first
>   time a new packfile must be created due to packfile size limits. While
>   disabling the plugging state only results in suboptimal behavior for
>   the current code, it would be fatal for the bulk-fsync functionality
>   later in this patch series.

Paraphrasing to make sure I understand your reasoning here...

In the "plug and then perform as many changes to the repository and
finally unplug" flow, before or after this series, the "perform"
step in the middle is unaware of which "bulk_checkin_state" instance
is being used to keep track of what is done to optimize by deferring
some operations until the "unplug" time.  So bulk_checkin_state is
not there to allow us to create multiple instances of it, pass them
around to different sequences of "plug, perform, unplug".  Each of
its members is inherently a singleton, so in the extreme, we could
turn these members into separate file-scope global variables if we
wanted to.  The "plugged" bit happens to be the only one getting
ejected by this patch, because it is inconvenient to "clear" other
members otherwise.

Is that what is going on?

If it is, I am mildly opposed to the flow of thought, from at least
two reasons.  It makes it hard for the next developer to decide if
the new members they are adding should be in or out of the struct.

More importantly, I think the call of finish_bulk_checkin() we make
in deflate_to_pack() you found (and there may possibly other places
that we do so; I didn't check) may not appear to be a bug in the
original context, but it already is a bug.  And when we change the
semantics of plug-unplug to be more "transaction-like", it becomes a
more serious bug, as you said.

There is NO reason to end the ongoing transaction there inside the
while() loop that tries to limit the size of the packfile being
used.  We may want to flush the "packfile part", which may have been
almost synonymous to the entirety of bulk_checkin_state, but as you
found out, the "plugged" bit is *outside* the "packfile part", and
that makes it a bug to call finish_bulk_checkin() from there.

We should add a new function, flush_bulk_checking_packfile(), to
flush only the packfile part of the bulk_checkin_state without
affecting other things---the "plugged" bit is the only one in the
current code before this series, but it does not have to stay to be
so.  When you start plugging the loose ref transactions, you may
find it handy (this is me handwaving) to have a list of refs that
you may have to do something at "unplug" time kept in the struct,
and you do not want deflate_to_pack() affecting the ongoing
"plugged" ref operations by calling finish_bulk_checkin() and
reinitializing that list, for example.

And then we should examine existing calls to finish_bulk_checkin()
and replace the ones that should not be finishing, i.e. the ones
that wanted "flush" but called "finish".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 30, 2022 at 10:11 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Batched fsync will fit into bulk-checkin by taking advantage of the
> > plug/unplug functionality to determine the appropriate time to fsync
> > and make newly-added objects available in the primary object database.
> >
> > * Rename 'state' variable to 'bulk_checkin_state', since we will later
> >   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
> >   find in the debugger, since the name is more unique.
> >
> > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
> >   static variable. Doing this avoids resetting the variable in
> >   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
> >   seem to unintentionally disable the plugging functionality the first
> >   time a new packfile must be created due to packfile size limits. While
> >   disabling the plugging state only results in suboptimal behavior for
> >   the current code, it would be fatal for the bulk-fsync functionality
> >   later in this patch series.
>
> Paraphrasing to make sure I understand your reasoning here...
>
> In the "plug and then perform as many changes to the repository and
> finally unplug" flow, before or after this series, the "perform"
> step in the middle is unaware of which "bulk_checkin_state" instance
> is being used to keep track of what is done to optimize by deferring
> some operations until the "unplug" time.  So bulk_checkin_state is
> not there to allow us to create multiple instances of it, pass them
> around to different sequences of "plug, perform, unplug".  Each of
> its members is inherently a singleton, so in the extreme, we could
> turn these members into separate file-scope global variables if we
> wanted to.  The "plugged" bit happens to be the only one getting
> ejected by this patch, because it is inconvenient to "clear" other
> members otherwise.
>
> Is that what is going on?
>

More or less.  The current state is all about creating a single
packfile for multiple large objects.  That packfile is a singleton
today (we could have an alternate implementation where there's a
separate packfile per thread in the future, so it's not inherent to
the API).  We want to do this if the top-level caller is okay with the
state being invisible until the "finish" call, and that is conveyed by
the "plugged" flag.

> If it is, I am mildly opposed to the flow of thought, from at least
> two reasons.  It makes it hard for the next developer to decide if
> the new members they are adding should be in or out of the struct.
>
> More importantly, I think the call of finish_bulk_checkin() we make
> in deflate_to_pack() you found (and there may possibly other places
> that we do so; I didn't check) may not appear to be a bug in the
> original context, but it already is a bug.  And when we change the
> semantics of plug-unplug to be more "transaction-like", it becomes a
> more serious bug, as you said.
>
> There is NO reason to end the ongoing transaction there inside the
> while() loop that tries to limit the size of the packfile being
> used.  We may want to flush the "packfile part", which may have been
> almost synonymous to the entirety of bulk_checkin_state, but as you
> found out, the "plugged" bit is *outside* the "packfile part", and
> that makes it a bug to call finish_bulk_checkin() from there.
>
> We should add a new function, flush_bulk_checking_packfile(), to
> flush only the packfile part of the bulk_checkin_state without
> affecting other things---the "plugged" bit is the only one in the
> current code before this series, but it does not have to stay to be
> so

I'm happy to rename the packfile related stuff to end with _packfile
to make it clear that all of that state and functionality is related
to batching of packfile additions.
So from this patch: s/bulk_checkin_state/bulk_checkin_packfile and
s/finish_bulk_checkin/finish_bulk_checkin_packfile.

My new state will be bulk_fsync_* (as it is already).  Any future
ODB-related state can go here too (I'm imagining a future
log-structured 'new objects pack' that we can append to for adding
small objects, similar to the bulk_checkin_packfile but allowing
appends from multiple git invocations).

> When you start plugging the loose ref transactions, you may
> find it handy (this is me handwaving) to have a list of refs that
> you may have to do something at "unplug" time kept in the struct,
> and you do not want deflate_to_pack() affecting the ongoing
> "plugged" ref operations by calling finish_bulk_checkin() and
> reinitializing that list, for example.
>

I don't believe ref transactions will go through this part of the
infrastructure.  Refs already have a good transaction system (that
partly inspired this rebranding, after I saw how Peter implemented
batch ref fsync).  I expect this area will remain all about the ODB as
a subsystem that can enlist in a larger repo->transaction.  So a
top-level Git command might initiate a repo transaction, which would
internally initiate an ODB transaction, index transaction, and ref
transaction. The repo transaction would support flushing each of the
subtransactions with an optimal number of fsyncs.

> And then we should examine existing calls to finish_bulk_checkin()
> and replace the ones that should not be finishing, i.e. the ones
> that wanted "flush" but called "finish".

Sure.  I can fix this, which will only change this file.  The only
case of "finishing" would be in unplug_bulk_checkin /
end_odb_transaction.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Neeraj Singh <nksingh85@gmail.com> writes:

>> We should add a new function, flush_bulk_checking_packfile(), to
>> flush only the packfile part of the bulk_checkin_state without
>> affecting other things---the "plugged" bit is the only one in the
>> current code before this series, but it does not have to stay to be
>> so
>
> I'm happy to rename the packfile related stuff to end with _packfile
> to make it clear that all of that state and functionality is related
> to batching of packfile additions.

I do not care about names, though.  If you took that I hinted any
such change, sorry about that.  _state is fine as-is.

I do care about not ejecting plugged out of the structure and
instead keeping them together, with proper way to flush the part
that deflate_to_pack() wants to flush, instead of abusing the
"finish".

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 30, 2022 at 1:24 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> >> We should add a new function, flush_bulk_checking_packfile(), to
> >> flush only the packfile part of the bulk_checkin_state without
> >> affecting other things---the "plugged" bit is the only one in the
> >> current code before this series, but it does not have to stay to be
> >> so
> >
> > I'm happy to rename the packfile related stuff to end with _packfile
> > to make it clear that all of that state and functionality is related
> > to batching of packfile additions.
>
> I do not care about names, though.  If you took that I hinted any
> such change, sorry about that.  _state is fine as-is.
>
> I do care about not ejecting plugged out of the structure and
> instead keeping them together, with proper way to flush the part
> that deflate_to_pack() wants to flush, instead of abusing the
> "finish".
>
> Thanks.

Just to understand your feedback better, is it a problem to separate
the state of each separate "thing" under ODB transactions into
separate file-scope global(s)?  In this series I declared the fsync
state as completely separate from the packfile state.  That's why I
was thinking of it as more of a naming problem, since the remaining
state aside from the plugged boolean is entirely packfile related.

My argument in favor of having separate file-scoped variables for each
'pluggable thing' would be that future implementations can evolve
separately without authors first having to disentangle a single
struct.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Neeraj Singh <nksingh85@gmail.com> writes:

> Just to understand your feedback better, is it a problem to separate
> the state of each separate "thing" under ODB transactions into
> separate file-scope global(s)?  In this series I declared the fsync
> state as completely separate from the packfile state.  That's why I
> was thinking of it as more of a naming problem, since the remaining
> state aside from the plugged boolean is entirely packfile related.

Ahh, sorry, my mistake.

I somehow thought that you would be making the existing "struct
bulk_checkin_state" infrastructure to cover not just the object
store but much more, perhaps because I partly mistook the motivation
to rename the structure (thinking again about it, since "checkin" is
the act of adding new objects to the object database from outside
sources (either from the working tree using "git add" command, or
from other sources using unpack-objects), the original name was
already fine to signal that it was about the object database, and
the need to rename it sounded like we were going to do much more
than the object database behind my head).

> My argument in favor of having separate file-scoped variables for each
> 'pluggable thing' would be that future implementations can evolve
> separately without authors first having to disentangle a single
> struct.

That is fine.  Would the trigger to "plug" and "unplug" also be
independent?  As you said elsewhere, the part to harden refs can
piggyback on the existing ref-transaction infrastructure.  I do not
know offhand what things other than loose objects that want "plug"
and "unplug" semantics, but if there are, are we going to have type
specific begin- and end-transaction?

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Thu, Mar 31, 2022 at 10:50 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> > Just to understand your feedback better, is it a problem to separate
> > the state of each separate "thing" under ODB transactions into
> > separate file-scope global(s)?  In this series I declared the fsync
> > state as completely separate from the packfile state.  That's why I
> > was thinking of it as more of a naming problem, since the remaining
> > state aside from the plugged boolean is entirely packfile related.
>
> Ahh, sorry, my mistake.
>
> I somehow thought that you would be making the existing "struct
> bulk_checkin_state" infrastructure to cover not just the object
> store but much more, perhaps because I partly mistook the motivation
> to rename the structure (thinking again about it, since "checkin" is
> the act of adding new objects to the object database from outside
> sources (either from the working tree using "git add" command, or
> from other sources using unpack-objects), the original name was
> already fine to signal that it was about the object database, and
> the need to rename it sounded like we were going to do much more
> than the object database behind my head).
>
> > My argument in favor of having separate file-scoped variables for each
> > 'pluggable thing' would be that future implementations can evolve
> > separately without authors first having to disentangle a single
> > struct.
>
> That is fine.  Would the trigger to "plug" and "unplug" also be
> independent?  As you said elsewhere, the part to harden refs can
> piggyback on the existing ref-transaction infrastructure.  I do not
> know offhand what things other than loose objects that want "plug"
> and "unplug" semantics, but if there are, are we going to have type
> specific begin- and end-transaction?
>

With regards to bulk-checkin.h, I believe for simplicity of interface
to the callers, there should be a single pair of APIs for plug or
unplug of the entire ODB regardless of what optimizations happen under
the covers.  For eventual repo-wide transactions, there should be a
single API to initiate a transaction and a single one to commit/abort
the transaction at the end.  We may still also want a flush API so
that we can make the repo state consistent prior to executing hooks or
doing something else where an outside process needs consistent repo
state.

Thanks,
Neeraj

#include "object-store.h"

static struct bulk_checkin_state {
unsigned plugged:1;
static int odb_transaction_nesting;

static struct tmp_objdir *bulk_fsync_objdir;

static struct bulk_checkin_packfile {
char *pack_tmp_name;
struct hashfile *f;
off_t offset;
Expand All @@ -21,7 +26,7 @@ static struct bulk_checkin_state {
struct pack_idx_entry **written;
uint32_t alloc_written;
uint32_t nr_written;
} state;
} bulk_checkin_packfile;

static void finish_tmp_packfile(struct strbuf *basename,
const char *pack_tmp_name,
Expand All @@ -39,7 +44,7 @@ static void finish_tmp_packfile(struct strbuf *basename,
free(idx_tmp_name);
}

static void finish_bulk_checkin(struct bulk_checkin_state *state)
static void flush_bulk_checkin_packfile(struct bulk_checkin_packfile *state)
{
unsigned char hash[GIT_MAX_RAWSZ];
struct strbuf packname = STRBUF_INIT;
Expand Down Expand Up @@ -80,7 +85,41 @@ static void finish_bulk_checkin(struct bulk_checkin_state *state)
reprepare_packed_git(the_repository);
}

static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
/*
* Cleanup after batch-mode fsync_object_files.
*/
static void flush_batch_fsync(void)
{
struct strbuf temp_path = STRBUF_INIT;
struct tempfile *temp;

if (!bulk_fsync_objdir)
return;

/*
* Issue a full hardware flush against a temporary file to ensure
* that all objects are durable before any renames occur. The code in
* fsync_loose_object_bulk_checkin has already issued a writeout
* request, but it has not flushed any writeback cache in the storage
* hardware or any filesystem logs. This fsync call acts as a barrier
* to ensure that the data in each new object file is durable before
* the final name is visible.
*/
strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
temp = xmks_tempfile(temp_path.buf);
fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
delete_tempfile(&temp);
strbuf_release(&temp_path);

/*
* Make the object files visible in the primary ODB after their data is
* fully durable.
*/
tmp_objdir_migrate(bulk_fsync_objdir);
bulk_fsync_objdir = NULL;
}

static int already_written(struct bulk_checkin_packfile *state, struct object_id *oid)
{
int i;

Expand Down Expand Up @@ -112,7 +151,7 @@ static int already_written(struct bulk_checkin_state *state, struct object_id *o
* status before calling us just in case we ask it to call us again
* with a new pack.
*/
static int stream_to_pack(struct bulk_checkin_state *state,
static int stream_to_pack(struct bulk_checkin_packfile *state,
git_hash_ctx *ctx, off_t *already_hashed_to,
int fd, size_t size, enum object_type type,
const char *path, unsigned flags)
Expand Down Expand Up @@ -189,7 +228,7 @@ static int stream_to_pack(struct bulk_checkin_state *state,
}

/* Lazily create backing packfile for the state */
static void prepare_to_stream(struct bulk_checkin_state *state,
static void prepare_to_stream(struct bulk_checkin_packfile *state,
unsigned flags)
{
if (!(flags & HASH_WRITE_OBJECT) || state->f)
Expand All @@ -204,7 +243,7 @@ static void prepare_to_stream(struct bulk_checkin_state *state,
die_errno("unable to write pack header");
}

static int deflate_to_pack(struct bulk_checkin_state *state,
static int deflate_to_pack(struct bulk_checkin_packfile *state,
struct object_id *result_oid,
int fd, size_t size,
enum object_type type, const char *path,
Expand Down Expand Up @@ -251,7 +290,7 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
BUG("should not happen");
hashfile_truncate(state->f, &checkpoint);
state->offset = checkpoint.offset;
finish_bulk_checkin(state);
flush_bulk_checkin_packfile(state);
if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
return error("cannot seek back");
}
Expand All @@ -274,25 +313,66 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
return 0;
}

void prepare_loose_object_bulk_checkin(void)
{
/*
* We lazily create the temporary object directory
* the first time an object might be added, since
* callers may not know whether any objects will be
* added at the time they call begin_odb_transaction.
*/
if (!odb_transaction_nesting || bulk_fsync_objdir)
return;

bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
if (bulk_fsync_objdir)
tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
}

void fsync_loose_object_bulk_checkin(int fd, const char *filename)
{
/*
* If we have an active ODB transaction, we issue a call that
* cleans the filesystem page cache but avoids a hardware flush
* command. Later on we will issue a single hardware flush
* before as part of flush_batch_fsync.
*/
if (!bulk_fsync_objdir ||
git_fsync(fd, FSYNC_WRITEOUT_ONLY) < 0) {
fsync_or_die(fd, filename);
}
}

int index_bulk_checkin(struct object_id *oid,
int fd, size_t size, enum object_type type,
const char *path, unsigned flags)
{
int status = deflate_to_pack(&state, oid, fd, size, type,
int status = deflate_to_pack(&bulk_checkin_packfile, oid, fd, size, type,
path, flags);
if (!state.plugged)
finish_bulk_checkin(&state);
if (!odb_transaction_nesting)
flush_bulk_checkin_packfile(&bulk_checkin_packfile);
return status;
}

void plug_bulk_checkin(void)
void begin_odb_transaction(void)
{
state.plugged = 1;
odb_transaction_nesting += 1;
}

void unplug_bulk_checkin(void)
void flush_odb_transaction(void)
{
state.plugged = 0;
if (state.f)
finish_bulk_checkin(&state);
flush_batch_fsync();
flush_bulk_checkin_packfile(&bulk_checkin_packfile);
}

void end_odb_transaction(void)
{
odb_transaction_nesting -= 1;
if (odb_transaction_nesting < 0)
BUG("Unbalanced ODB transaction nesting");

if (odb_transaction_nesting)
return;

flush_odb_transaction();
}
27 changes: 25 additions & 2 deletions bulk-checkin.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,34 @@

#include "cache.h"

void prepare_loose_object_bulk_checkin(void);
void fsync_loose_object_bulk_checkin(int fd, const char *filename);

int index_bulk_checkin(struct object_id *oid,
int fd, size_t size, enum object_type type,
const char *path, unsigned flags);

void plug_bulk_checkin(void);
void unplug_bulk_checkin(void);
/*
* Tell the object database to optimize for adding
* multiple objects. end_odb_transaction must be called
* to make new objects visible. Transactions can be nested,
* and objects are only visible after the outermost transaction
* is complete or the transaction is flushed.
*/
void begin_odb_transaction(void);

/*
* Make any objects that are currently part of a pending object
* database transaction visible. It is valid to call this function
* even if no transaction is active.
*/
void flush_odb_transaction(void);

/*
* Tell the object database to make any objects from the
* current transaction visible if this is the final nested
* transaction.
*/
void end_odb_transaction(void);

#endif
3 changes: 3 additions & 0 deletions cache-tree.c
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
#include "tree.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> Take advantage of the odb transaction infrastructure around writing the
> cached tree to the object database.
>
> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> ---
>  cache-tree.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 6752f69d515..8c5e8822716 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -3,6 +3,7 @@
>  #include "tree.h"
>  #include "tree-walk.h"
>  #include "cache-tree.h"
> +#include "bulk-checkin.h"
>  #include "object-store.h"
>  #include "replace-object.h"
>  #include "promisor-remote.h"
> @@ -474,8 +475,10 @@ int cache_tree_update(struct index_state *istate, int flags)
>  
>  	trace_performance_enter();
>  	trace2_region_enter("cache_tree", "update", the_repository);

There is no I/O in update_one() when the WRITE_TREE_DRY_RUN bit is
set, so we _could_ optimize the begin/end away with

	if (!(flags & WRITE_TREE_DRY_RUN))
		begin_odb_transaction()

> +	begin_odb_transaction();
>  	i = update_one(istate->cache_tree, istate->cache, istate->cache_nr,
>  		       "", 0, &skip, flags);
> +	end_odb_transaction();
>  	trace2_region_leave("cache_tree", "update", the_repository);
>  	trace_performance_leave("cache_tree_update");
>  	if (i < 0)

I do not know if that is worth it.  If we do not do any object
creation inside begin/end, we don't even create the temporary object
directory and there is nothing we need to do when we "unplug".  So
this would be fine as-is, but I may be overlooking something, so I
thought I'd mention it for completeness.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Mar 30, 2022 at 10:47 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > Take advantage of the odb transaction infrastructure around writing the
> > cached tree to the object database.
> >
> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> > ---
> >  cache-tree.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/cache-tree.c b/cache-tree.c
> > index 6752f69d515..8c5e8822716 100644
> > --- a/cache-tree.c
> > +++ b/cache-tree.c
> > @@ -3,6 +3,7 @@
> >  #include "tree.h"
> >  #include "tree-walk.h"
> >  #include "cache-tree.h"
> > +#include "bulk-checkin.h"
> >  #include "object-store.h"
> >  #include "replace-object.h"
> >  #include "promisor-remote.h"
> > @@ -474,8 +475,10 @@ int cache_tree_update(struct index_state *istate, int flags)
> >
> >       trace_performance_enter();
> >       trace2_region_enter("cache_tree", "update", the_repository);
>
> There is no I/O in update_one() when the WRITE_TREE_DRY_RUN bit is
> set, so we _could_ optimize the begin/end away with
>
>         if (!(flags & WRITE_TREE_DRY_RUN))
>                 begin_odb_transaction()
>
> > +     begin_odb_transaction();
> >       i = update_one(istate->cache_tree, istate->cache, istate->cache_nr,
> >                      "", 0, &skip, flags);
> > +     end_odb_transaction();
> >       trace2_region_leave("cache_tree", "update", the_repository);
> >       trace_performance_leave("cache_tree_update");
> >       if (i < 0)
>
> I do not know if that is worth it.  If we do not do any object
> creation inside begin/end, we don't even create the temporary object
> directory and there is nothing we need to do when we "unplug".  So
> this would be fine as-is, but I may be overlooking something, so I
> thought I'd mention it for completeness.
>

Yes, with the current series, beginning and ending a transaction will
just manipulate a few global variables in bulk-checkin.c unless there
is something real to flush.

Thanks,
Neeraj

#include "tree-walk.h"
#include "cache-tree.h"
#include "bulk-checkin.h"
#include "object-store.h"
#include "replace-object.h"
#include "promisor-remote.h"
Expand Down Expand Up @@ -474,8 +475,10 @@ int cache_tree_update(struct index_state *istate, int flags)

trace_performance_enter();
trace2_region_enter("cache_tree", "update", the_repository);
begin_odb_transaction();
i = update_one(istate->cache_tree, istate->cache, istate->cache_nr,
"", 0, &skip, flags);
end_odb_transaction();
trace2_region_leave("cache_tree", "update", the_repository);
trace_performance_leave("cache_tree_update");
if (i < 0)
Expand Down
Loading