Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A design for future-proofing fsync() configuration #1093

Closed

Conversation

neerajsi-msft
Copy link

@neerajsi-msft neerajsi-msft commented Dec 4, 2021

This is an implementation of an extensible configuration mechanism for fsyncing persistent components of a repo.

The main goals are to separate the "what" to sync from the "how".
There are now two settings:
core.fsync - Control the 'what', including the index.
core.fsyncMethod - Control the 'how'. Currently we support writeout-only and full fsync.

Syncing of refs can be layered on top of core.fsync. And batch mode will be layered on core.fsyncMethod. Once this series reaches 'seen', I'll submit ns/batched-fsync to introduce batch mode. Please see #1134.

core.fsyncObjectfiles is removed and we will issue a deprecation warning if it's seen.

I'd like to get agreement on this direction before submitting batch mode to the list. The batch mode series is available to view at

Please see [1], [2], and [3] for discussions that led to this series.

After this change, new persistent data files added to the repo will need to be added to the fsync_component enum and documented in the Documentation/config/core.txt text.

V6 changes:

  • Move only the windows csprng includes into wrapper.c rather than all of them. This fixes the specific build issue due to broken Windows headers. [6]
  • Split the configuration parsing of core.fsync from the mechanism to focus the review.
  • Incorporate Patrick's patch at [7] into the core.fsync mechanism patch.
  • Pick the stricter one of core.fsyncObjectFiles and (fsync_components & FSYNC_COMPONENT_LOOSE_OBJECTS), to respect the older setting.
  • Issue a deprecation warning but keep parsing and honoring core.fsyncObjectFiles.
  • Change configuration parsing of core.fsync to always start with the platform default. none resets to the empty set. The comma separated list implies a set without regards to ordering now. This follows Junio's suggestion in [8].
  • Change the documentation of the core.fsync option to reflect the way the new parsing code works.
  • The patch 7 and 8 of Patrick's series at [7] can be cherry-picked after being applied to ns/core-fsyncmethod.

V5 changes:

  • Rebase onto main at c216290
  • Add a patch to move CSPRNG platform includes to wrapper.c. This avoids build errors in compat/win32/flush.c and other files.
  • Move the documentation and aggregate options to the final patch in the series.
  • Define new aggregate options and guidance in line with Junio's suggestion to present the user with 'levels of safety' rather than a morass of detailed options.

V4 changes:

  • Rebase onto master at b23dac9.
  • Add a comment to write_pack_file indicating why we don't fsync when writing to stdout.
  • I kept the configuration schema as-is rather than switching to multi-value. The thinking here is that a stateless last-one-wins config schema (comma separated) will make it easier to achieve some holistic self-consistent fsync configuration for a particular repo.

V3 changes:

  • Remove relative path from git-compat-util.h include [4].
  • Updated newly added warning texts to have more context for localization [4].
  • Fixed tab spacing in enum fsync_action
  • Moved the fsync looping out to a helper and do it consistently. [4]
  • Changed commit description to use camelCase for config names. [5]
  • Add an optional fourth patch with derived-metadata so that the user can exclude a forward-compatible set of things that should be recomputable given existing data.

V2 changes:

  • Updated the documentation for core.fsyncmethod to be less certain. writeout-only probably does not do the right thing on Linux.
  • Split out the core.fsync=index change into its own commit.
  • Rename REPO_COMPONENT to FSYNC_COMPONENT. This is really specific to fsyncing, so the name should reflect that.
  • Re-add missing Makefile change for SYNC_FILE_RANGE.
  • Tested writeout-only mode, index syncing, and general config settings.

[1] https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/
[2] https://lore.kernel.org/git/dd65718814011eb93ccc4428f9882e0f025224a6.1636029491.git.ps@pks.im/
[3] https://lore.kernel.org/git/pull.1076.git.git.1629856292.gitgitgadget@gmail.com/
[4] https://lore.kernel.org/git/CANQDOdf8C4-haK9=Q_J4Cid8bQALnmGDm=SvatRbaVf+tkzqLw@mail.gmail.com/
[5] https://lore.kernel.org/git/211207.861r2opplg.gmgdl@evledraar.gmail.com/
[6] https://lore.kernel.org/git/CANQDOdfZbOHZQt9Ah0t1AamTO2T7Gq0tmWX1jLqL6njE0LF6DA@mail.gmail.com/
[7] https://lore.kernel.org/git/50e39f698a7c0cc06d3bc060e6dbc539ea693241.1646905589.git.ps@pks.im/
[8] https://lore.kernel.org/git/xmqqk0d1cxsv.fsf@gitster.g/

cc: rsbecker@nexbridge.com
cc: bagasdotme@gmail.com
cc: newren@gmail.com
cc: avarab@gmail.com
cc: nksingh85@gmail.com
cc: ps@pks.im
cc: sandals@crustytoothpaste.net
cc: Johannes Schindelin Johannes.Schindelin@gmx.de
cc: SZEDER Gábor szeder.dev@gmail.com
cc: Jiang Xin worldhello.net@gmail.com

@neerajsi-msft
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 4, 2021

Submitted as pull.1093.git.1638588503.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-1093/neerajsi-msft/ns/core-fsync-v1

To fetch this version to local tag pr-1093/neerajsi-msft/ns/core-fsync-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-1093/neerajsi-msft/ns/core-fsync-v1

@neerajsi-msft neerajsi-msft force-pushed the ns/core-fsync branch 5 times, most recently from 53b8d12 to eb38385 Compare December 6, 2021 23:02
@neerajsi-msft
Copy link
Author

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 7, 2021

Preview email sent as pull.1093.v2.git.1638841190.gitgitgadget@gmail.com

@neerajsi-msft
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 7, 2021

Submitted as pull.1093.v2.git.1638845211.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-1093/neerajsi-msft/ns/core-fsync-v2

To fetch this version to local tag pr-1093/neerajsi-msft/ns/core-fsync-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-1093/neerajsi-msft/ns/core-fsync-v2

@@ -547,6 +547,15 @@ core.whitespace::
is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):


--FRW3ky2b/hqIXfuW
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Dec 07, 2021 at 02:46:49AM +0000, Neeraj Singh via GitGitGadget wro=
te:
> From: Neeraj Singh <neerajsi@microsoft.com>
[snip]
> --- a/compat/mingw.h
> +++ b/compat/mingw.h
> @@ -329,6 +329,9 @@ int mingw_getpagesize(void);
>  #define getpagesize mingw_getpagesize
>  #endif
> =20
> +int win32_fsync_no_flush(int fd);
> +#define fsync_no_flush win32_fsync_no_flush
> +
>  struct rlimit {
>  	unsigned int rlim_cur;
>  };
> diff --git a/compat/win32/flush.c b/compat/win32/flush.c
> new file mode 100644
> index 00000000000..75324c24ee7
> --- /dev/null
> +++ b/compat/win32/flush.c
> @@ -0,0 +1,28 @@
> +#include "../../git-compat-util.h"
> +#include <winternl.h>
> +#include "lazyload.h"
> +
> +int win32_fsync_no_flush(int fd)
> +{
> +       IO_STATUS_BLOCK io_status;
> +
> +#define FLUSH_FLAGS_FILE_DATA_ONLY 1
> +
> +       DECLARE_PROC_ADDR(ntdll.dll, NTSTATUS, NtFlushBuffersFileEx,
> +			 HANDLE FileHandle, ULONG Flags, PVOID Parameters, ULONG ParameterSiz=
e,
> +			 PIO_STATUS_BLOCK IoStatusBlock);
> +
> +       if (!INIT_PROC_ADDR(NtFlushBuffersFileEx)) {
> +		errno =3D ENOSYS;
> +		return -1;
> +       }

I'm wondering whether it would make sense to fall back to fsync(3P) in
case we cannot use writeout-only, but I see that were doing essentially
that in `fsync_or_die()`. There is no indicator to the user though that
writeout-only doesn't work -- do we want to print a one-time warning?

> +       memset(&io_status, 0, sizeof(io_status));
> +       if (NtFlushBuffersFileEx((HANDLE)_get_osfhandle(fd), FLUSH_FLAGS_=
FILE_DATA_ONLY,
> +				NULL, 0, &io_status)) {
> +		errno =3D EINVAL;
> +		return -1;
> +       }
> +
> +       return 0;
> +}

[snip]
> diff --git a/wrapper.c b/wrapper.c
> index 36e12119d76..1c5f2c87791 100644
> --- a/wrapper.c
> +++ b/wrapper.c
> @@ -546,6 +546,62 @@ int xmkstemp_mode(char *filename_template, int mode)
>  	return fd;
>  }
> =20
> +int git_fsync(int fd, enum fsync_action action)
> +{
> +	switch (action) {
> +	case FSYNC_WRITEOUT_ONLY:
> +
> +#ifdef __APPLE__
> +		/*
> +		 * on macOS, fsync just causes filesystem cache writeback but does not
> +		 * flush hardware caches.
> +		 */
> +		return fsync(fd);

Below we're looping around `EINTR` -- are Apple systems never returning
it?

Patrick

> +#endif
> +
> +#ifdef HAVE_SYNC_FILE_RANGE
> +		/*
> +		 * On linux 2.6.17 and above, sync_file_range is the way to issue
> +		 * a writeback without a hardware flush. An offset of 0 and size of 0
> +		 * indicates writeout of the entire file and the wait flags ensure tha=
t all
> +		 * dirty data is written to the disk (potentially in a disk-side cache)
> +		 * before we continue.
> +		 */
> +
> +		return sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE |
> +						 SYNC_FILE_RANGE_WRITE |
> +						 SYNC_FILE_RANGE_WAIT_AFTER);
> +#endif
> +
> +#ifdef fsync_no_flush
> +		return fsync_no_flush(fd);
> +#endif
> +
> +		errno =3D ENOSYS;
> +		return -1;
> +
> +	case FSYNC_HARDWARE_FLUSH:
> +		/*
> +		 * On some platforms fsync may return EINTR. Try again in this
> +		 * case, since callers asking for a hardware flush may die if
> +		 * this function returns an error.
> +		 */
> +		for (;;) {
> +			int err;
> +#ifdef __APPLE__
> +			err =3D fcntl(fd, F_FULLFSYNC);
> +#else
> +			err =3D fsync(fd);
> +#endif
> +			if (err >=3D 0 || errno !=3D EINTR)
> +				return err;
> +		}
> +
> +	default:
> +		BUG("unexpected git_fsync(%d) call", action);
> +	}
> +}
> +
>  static int warn_if_unremovable(const char *op, const char *file, int rc)
>  {
>  	int err;
> diff --git a/write-or-die.c b/write-or-die.c
> index 0b1ec8190b6..0702acdd5e8 100644
> --- a/write-or-die.c
> +++ b/write-or-die.c
> @@ -57,10 +57,12 @@ void fprintf_or_die(FILE *f, const char *fmt, ...)
> =20
>  void fsync_or_die(int fd, const char *msg)
>  {
> -	while (fsync(fd) < 0) {
> -		if (errno !=3D EINTR)
> -			die_errno("fsync error on '%s'", msg);
> -	}
> +	if (fsync_method =3D=3D FSYNC_METHOD_WRITEOUT_ONLY &&
> +	    git_fsync(fd, FSYNC_WRITEOUT_ONLY) >=3D 0)
> +		return;
> +
> +	if (git_fsync(fd, FSYNC_HARDWARE_FLUSH) < 0)
> +		die_errno("fsync error on '%s'", msg);
>  }
> =20
>  void write_or_die(int fd, const void *buf, size_t count)
> --=20
> gitgitgadget
>=20

--FRW3ky2b/hqIXfuW
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEF9hrgiFbCdvenl/rVbJhu7ckPpQFAmGvSSYACgkQVbJhu7ck
PpSusRAAkSCjo70fipT9enLRntvP2bsxSqddv23zIdvhQMzsThePBBlvA8+6lK6H
VT6LHKUpeSeknLmMNOOF1+JX38aKOqn1OGFTviNchPDrsJYxM6wRUGrMoEpD1yOJ
fol0XQ74XC9v5eW81Fc4XEc6rsXMfByRT5Iyp3e5mp+plekzspUh4+i4v/FdrE4e
eY7dRZOD009h2sxXqlHQVUjPZ1O2eIUGKjdHX1T1T/4/ctuv2zkOR1dHpAChNKRe
KxKjTDWiR0sLZjrvwpuDu7Wyo+z5bDg3t8NFthZKYmmvf+R40GOOlXN8eVDT6kiU
zF6h3L4+sNCJl13KWKIaHLSVech4Xst0c4cEfVyE9Z7d7MDBRuRBBJbEEm1znLJm
54LfkAFIvGArfUFSyecwgL+4k7PAncrn3uIR5Qf0qSgIsGdYI8cbnl8N3w9zOdNM
GMH6VafsDwPQG5+jGt4KpQIOTq6KsoYaTkyntvCTkgRNyg0FneEj/HRFnwuGx8h4
P7q7VMhgD3Q7YFxLWyYNQMxyCgPdMNhhcw83CfTMgubcGCHr3Ubw0xYrFIt7wtfV
RDmrtrCculP94prD4tJwpvkkvbtVXBYIQDu7lDgk50yMvKSYfZu0FqelaiuMwKzX
u6RTZ56FHEYYeuuJoPJk4zulJOpSiu50ETYUsloS15kSrYfxpMQ=
=oHAu
-----END PGP SIGNATURE-----

--FRW3ky2b/hqIXfuW--

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Dec 07 2021, Patrick Steinhardt wrote:

> [[PGP Signed Part:Undecided]]
> On Tue, Dec 07, 2021 at 02:46:49AM +0000, Neeraj Singh via GitGitGadget wrote:
>> From: Neeraj Singh <neerajsi@microsoft.com>
> [...]
> [snip]
>> diff --git a/wrapper.c b/wrapper.c
>> index 36e12119d76..1c5f2c87791 100644
>> --- a/wrapper.c
>> +++ b/wrapper.c
>> @@ -546,6 +546,62 @@ int xmkstemp_mode(char *filename_template, int mode)
>>  	return fd;
>>  }
>>  
>> +int git_fsync(int fd, enum fsync_action action)
>> +{
>> +	switch (action) {
>> +	case FSYNC_WRITEOUT_ONLY:
>> +
>> +#ifdef __APPLE__
>> +		/*
>> +		 * on macOS, fsync just causes filesystem cache writeback but does not
>> +		 * flush hardware caches.
>> +		 */
>> +		return fsync(fd);
>
> Below we're looping around `EINTR` -- are Apple systems never returning
> it?

I think so per cccdfd22436 (fsync(): be prepared to see EINTR,
2021-06-04), but I'm not sure, but in any case it would make sense for
this to just call the same loop we've been calling elsewhere. It doesn't
seem to hurt, so we can just do that part in the portable portion of the
code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Tue, Dec 7, 2021 at 3:45 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Tue, Dec 07, 2021 at 02:46:49AM +0000, Neeraj Singh via GitGitGadget wrote:
> > From: Neeraj Singh <neerajsi@microsoft.com>
> [snip]
> > --- a/compat/mingw.h
> > +++ b/compat/mingw.h
> > @@ -329,6 +329,9 @@ int mingw_getpagesize(void);
> >  #define getpagesize mingw_getpagesize
> >  #endif
> >
> > +int win32_fsync_no_flush(int fd);
> > +#define fsync_no_flush win32_fsync_no_flush
> > +
> >  struct rlimit {
> >       unsigned int rlim_cur;
> >  };
> > diff --git a/compat/win32/flush.c b/compat/win32/flush.c
> > new file mode 100644
> > index 00000000000..75324c24ee7
> > --- /dev/null
> > +++ b/compat/win32/flush.c
> > @@ -0,0 +1,28 @@
> > +#include "../../git-compat-util.h"
> > +#include <winternl.h>
> > +#include "lazyload.h"
> > +
> > +int win32_fsync_no_flush(int fd)
> > +{
> > +       IO_STATUS_BLOCK io_status;
> > +
> > +#define FLUSH_FLAGS_FILE_DATA_ONLY 1
> > +
> > +       DECLARE_PROC_ADDR(ntdll.dll, NTSTATUS, NtFlushBuffersFileEx,
> > +                      HANDLE FileHandle, ULONG Flags, PVOID Parameters, ULONG ParameterSize,
> > +                      PIO_STATUS_BLOCK IoStatusBlock);
> > +
> > +       if (!INIT_PROC_ADDR(NtFlushBuffersFileEx)) {
> > +             errno = ENOSYS;
> > +             return -1;
> > +       }
>
> I'm wondering whether it would make sense to fall back to fsync(3P) in
> case we cannot use writeout-only, but I see that were doing essentially
> that in `fsync_or_die()`. There is no indicator to the user though that
> writeout-only doesn't work -- do we want to print a one-time warning?
>

I wanted to leave the fallback to the caller so that the algorithm can
be adjusted in some way based on whether writeout-only succeeded.  For
batched fsync object files, we refrain from doing the last fsync if we
were doing real fsyncs all along.

I didn't want to issue a warning, since this writeout-only codepath
might be invoked from multiple subprocesses, which would each
potentially issue their one warning.  The consequence of failing
writeout only is worse performance, but should not be compromised
safety, so I'm not sure the user gets enough from the warning to
justify something that's potentially spammy.  In practice, when batch
mode is adopted on Windows (by default), some older pre-Win8 systems
will experience fsyncs and equivalent performance to what they're
seeing today. I don't want these users to have a warning too.

> > +       memset(&io_status, 0, sizeof(io_status));
> > +       if (NtFlushBuffersFileEx((HANDLE)_get_osfhandle(fd), FLUSH_FLAGS_FILE_DATA_ONLY,
> > +                             NULL, 0, &io_status)) {
> > +             errno = EINVAL;
> > +             return -1;
> > +       }
> > +
> > +       return 0;
> > +}
>
> [snip]
> > diff --git a/wrapper.c b/wrapper.c
> > index 36e12119d76..1c5f2c87791 100644
> > --- a/wrapper.c
> > +++ b/wrapper.c
> > @@ -546,6 +546,62 @@ int xmkstemp_mode(char *filename_template, int mode)
> >       return fd;
> >  }
> >
> > +int git_fsync(int fd, enum fsync_action action)
> > +{
> > +     switch (action) {
> > +     case FSYNC_WRITEOUT_ONLY:
> > +
> > +#ifdef __APPLE__
> > +             /*
> > +              * on macOS, fsync just causes filesystem cache writeback but does not
> > +              * flush hardware caches.
> > +              */
> > +             return fsync(fd);
>
> Below we're looping around `EINTR` -- are Apple systems never returning
> it?
>

The EINTR check was added due to a test failure on HP NonStop.  I
don't believe any other platform has actually complained about that.

Thanks again for the code review!
-Neeraj

@@ -547,13 +547,33 @@ core.whitespace::
is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):


--rj+/7kVG+u/4iNlT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Dec 07, 2021 at 02:46:50AM +0000, Neeraj Singh via GitGitGadget wro=
te:
> From: Neeraj Singh <neerajsi@microsoft.com>
[snip]
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index c23d01de7dc..c32534c13b4 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1286,7 +1286,7 @@ static void conclude_pack(int fix_thin_pack, const =
char *curr_pack, unsigned cha
>  			    nr_objects - nr_objects_initial);
>  		stop_progress_msg(&progress, msg.buf);
>  		strbuf_release(&msg);
> -		finalize_hashfile(f, tail_hash, 0);
> +		finalize_hashfile(f, tail_hash, FSYNC_COMPONENT_PACK, 0);
>  		hashcpy(read_hash, pack_hash);
>  		fixup_pack_header_footer(output_fd, pack_hash,
>  					 curr_pack, nr_objects,
> @@ -1508,7 +1508,7 @@ static void final(const char *final_pack_name, cons=
t char *curr_pack_name,
>  	if (!from_stdin) {
>  		close(input_fd);
>  	} else {
> -		fsync_or_die(output_fd, curr_pack_name);
> +		fsync_component_or_die(FSYNC_COMPONENT_PACK, output_fd, curr_pack_name=
);
>  		err =3D close(output_fd);
>  		if (err)
>  			die_errno(_("error while closing pack file"));
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 857be7826f3..916c55d6ce9 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
>  		 * If so, rewrite it like in fast-import
>  		 */
>  		if (pack_to_stdout) {
> -			finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> +			finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
> +					  CSUM_HASH_IN_STREAM | CSUM_CLOSE);

It doesn't have any effect here given that we don't sync at all when
writing to stdout, but I wonder whether we should set up the component
correctly regardless of that such that it makes for a less confusing
read.

[snip]
> diff --git a/config.c b/config.c
> index c3410b8a868..29c867aab03 100644
> --- a/config.c
> +++ b/config.c
> @@ -1213,6 +1213,73 @@ static int git_parse_maybe_bool_text(const char *v=
alue)
>  	return -1;
>  }
> =20
> +static const struct fsync_component_entry {
> +	const char *name;
> +	enum fsync_component component_bits;
> +} fsync_component_table[] =3D {
> +	{ "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
> +	{ "pack", FSYNC_COMPONENT_PACK },
> +	{ "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
> +	{ "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
> +	{ "objects", FSYNC_COMPONENTS_OBJECTS },
> +	{ "default", FSYNC_COMPONENTS_DEFAULT },
> +	{ "all", FSYNC_COMPONENTS_ALL },
> +};
> +
> +static enum fsync_component parse_fsync_components(const char *var, cons=
t char *string)
> +{
> +	enum fsync_component output =3D 0;
> +
> +	if (!strcmp(string, "none"))
> +		return output;
> +
> +	while (string) {
> +		int i;
> +		size_t len;
> +		const char *ep;
> +		int negated =3D 0;
> +		int found =3D 0;
> +
> +		string =3D string + strspn(string, ", \t\n\r");
> +		ep =3D strchrnul(string, ',');
> +		len =3D ep - string;
> +
> +		if (*string =3D=3D '-') {
> +			negated =3D 1;
> +			string++;
> +			len--;
> +			if (!len)
> +				warning(_("invalid value for variable %s"), var);
> +		}
> +
> +		if (!len)
> +			break;
> +
> +		for (i =3D 0; i < ARRAY_SIZE(fsync_component_table); ++i) {
> +			const struct fsync_component_entry *entry =3D &fsync_component_table[=
i];
> +
> +			if (strncmp(entry->name, string, len))
> +				continue;
> +
> +			found =3D 1;
> +			if (negated)
> +				output &=3D ~entry->component_bits;
> +			else
> +				output |=3D entry->component_bits;
> +		}
> +
> +		if (!found) {
> +			char *component =3D xstrndup(string, len);
> +			warning(_("unknown %s value '%s'"), var, component);
> +			free(component);
> +		}
> +
> +		string =3D ep;
> +	}
> +
> +	return output;
> +}
> +
>  int git_parse_maybe_bool(const char *value)
>  {
>  	int v =3D git_parse_maybe_bool_text(value);
> @@ -1490,6 +1557,13 @@ static int git_default_core_config(const char *var=
, const char *value, void *cb)
>  		return 0;
>  	}
> =20
> +	if (!strcmp(var, "core.fsync")) {
> +		if (!value)
> +			return config_error_nonbool(var);
> +		fsync_components =3D parse_fsync_components(var, value);
> +		return 0;
> +	}
> +
>  	if (!strcmp(var, "core.fsyncmethod")) {
>  		if (!value)
>  			return config_error_nonbool(var);
> @@ -1503,7 +1577,7 @@ static int git_default_core_config(const char *var,=
 const char *value, void *cb)
>  	}
> =20
>  	if (!strcmp(var, "core.fsyncobjectfiles")) {
> -		fsync_object_files =3D git_config_bool(var, value);
> +		warning(_("core.fsyncobjectfiles is deprecated; use core.fsync instead=
"));
>  		return 0;
>  	}

Shouldn't we continue to support this for now such that users can
migrate from the old, deprecated value first before we start to ignore
it?

Patrick

> diff --git a/csum-file.c b/csum-file.c
> index 26e8a6df44e..59ef3398ca2 100644
> --- a/csum-file.c
> +++ b/csum-file.c
> @@ -58,7 +58,8 @@ static void free_hashfile(struct hashfile *f)
>  	free(f);
>  }
> =20
> -int finalize_hashfile(struct hashfile *f, unsigned char *result, unsigne=
d int flags)
> +int finalize_hashfile(struct hashfile *f, unsigned char *result,
> +		      enum fsync_component component, unsigned int flags)
>  {
>  	int fd;
> =20
> @@ -69,7 +70,7 @@ int finalize_hashfile(struct hashfile *f, unsigned char=
 *result, unsigned int fl
>  	if (flags & CSUM_HASH_IN_STREAM)
>  		flush(f, f->buffer, the_hash_algo->rawsz);
>  	if (flags & CSUM_FSYNC)
> -		fsync_or_die(f->fd, f->name);
> +		fsync_component_or_die(component, f->fd, f->name);
>  	if (flags & CSUM_CLOSE) {
>  		if (close(f->fd))
>  			die_errno("%s: sha1 file error on close", f->name);
> diff --git a/csum-file.h b/csum-file.h
> index 291215b34eb..0d29f528fbc 100644
> --- a/csum-file.h
> +++ b/csum-file.h
> @@ -1,6 +1,7 @@
>  #ifndef CSUM_FILE_H
>  #define CSUM_FILE_H
> =20
> +#include "cache.h"
>  #include "hash.h"
> =20
>  struct progress;
> @@ -38,7 +39,7 @@ int hashfile_truncate(struct hashfile *, struct hashfil=
e_checkpoint *);
>  struct hashfile *hashfd(int fd, const char *name);
>  struct hashfile *hashfd_check(const char *name);
>  struct hashfile *hashfd_throughput(int fd, const char *name, struct prog=
ress *tp);
> -int finalize_hashfile(struct hashfile *, unsigned char *, unsigned int);
> +int finalize_hashfile(struct hashfile *, unsigned char *, enum fsync_com=
ponent, unsigned int);
>  void hashwrite(struct hashfile *, const void *, unsigned int);
>  void hashflush(struct hashfile *f);
>  void crc32_begin(struct hashfile *);
> diff --git a/environment.c b/environment.c
> index f9140e842cf..09905adecf9 100644
> --- a/environment.c
> +++ b/environment.c
> @@ -42,6 +42,7 @@ const char *git_hooks_path;
>  int zlib_compression_level =3D Z_BEST_SPEED;
>  int pack_compression_level =3D Z_DEFAULT_COMPRESSION;
>  enum fsync_method fsync_method =3D FSYNC_METHOD_DEFAULT;
> +enum fsync_component fsync_components =3D FSYNC_COMPONENTS_DEFAULT;
>  size_t packed_git_window_size =3D DEFAULT_PACKED_GIT_WINDOW_SIZE;
>  size_t packed_git_limit =3D DEFAULT_PACKED_GIT_LIMIT;
>  size_t delta_base_cache_limit =3D 96 * 1024 * 1024;
> diff --git a/midx.c b/midx.c
> index 837b46b2af5..882f91f7d57 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -1406,7 +1406,8 @@ static int write_midx_internal(const char *object_d=
ir,
>  	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
>  	write_chunkfile(cf, &ctx);
> =20
> -	finalize_hashfile(f, midx_hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
> +	finalize_hashfile(f, midx_hash, FSYNC_COMPONENT_PACK_METADATA,
> +			  CSUM_FSYNC | CSUM_HASH_IN_STREAM);
>  	free_chunkfile(cf);
> =20
>  	if (flags & (MIDX_WRITE_REV_INDEX | MIDX_WRITE_BITMAP))
> diff --git a/object-file.c b/object-file.c
> index eb972cdccd2..9d9c4a39e85 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1809,8 +1809,7 @@ int hash_object_file(const struct git_hash_algo *al=
go, const void *buf,
>  /* Finalize a file on disk, and close it. */
>  static void close_loose_object(int fd)
>  {
> -	if (fsync_object_files)
> -		fsync_or_die(fd, "loose object file");
> +	fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd, "loose object =
file");
>  	if (close(fd) !=3D 0)
>  		die_errno(_("error when closing loose object file"));
>  }
> diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
> index 9c55c1531e1..c16e43d1669 100644
> --- a/pack-bitmap-write.c
> +++ b/pack-bitmap-write.c
> @@ -719,7 +719,8 @@ void bitmap_writer_finish(struct pack_idx_entry **ind=
ex,
>  	if (options & BITMAP_OPT_HASH_CACHE)
>  		write_hash_cache(f, index, index_nr);
> =20
> -	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOS=
E);
> +	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
> +			  CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE);
> =20
>  	if (adjust_shared_perm(tmp_file.buf))
>  		die_errno("unable to make temporary bitmap file readable");
> diff --git a/pack-write.c b/pack-write.c
> index a5846f3a346..51812cb1299 100644
> --- a/pack-write.c
> +++ b/pack-write.c
> @@ -159,9 +159,9 @@ const char *write_idx_file(const char *index_name, st=
ruct pack_idx_entry **objec
>  	}
> =20
>  	hashwrite(f, sha1, the_hash_algo->rawsz);
> -	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE |
> -				    ((opts->flags & WRITE_IDX_VERIFY)
> -				    ? 0 : CSUM_FSYNC));
> +	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
> +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE |
> +			  ((opts->flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
>  	return index_name;
>  }
> =20
> @@ -281,8 +281,9 @@ const char *write_rev_file_order(const char *rev_name,
>  	if (rev_name && adjust_shared_perm(rev_name) < 0)
>  		die(_("failed to make %s readable"), rev_name);
> =20
> -	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE |
> -				    ((flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
> +	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
> +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE |
> +			  ((flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
> =20
>  	return rev_name;
>  }
> @@ -390,7 +391,7 @@ void fixup_pack_header_footer(int pack_fd,
>  		the_hash_algo->final_fn(partial_pack_hash, &old_hash_ctx);
>  	the_hash_algo->final_fn(new_pack_hash, &new_hash_ctx);
>  	write_or_die(pack_fd, new_pack_hash, the_hash_algo->rawsz);
> -	fsync_or_die(pack_fd, pack_name);
> +	fsync_component_or_die(FSYNC_COMPONENT_PACK, pack_fd, pack_name);
>  }
> =20
>  char *index_pack_lockfile(int ip_out, int *is_well_formed)
> diff --git a/read-cache.c b/read-cache.c
> index f3986596623..f3539681f49 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -3060,7 +3060,7 @@ static int do_write_index(struct index_state *istat=
e, struct tempfile *tempfile,
>  			return -1;
>  	}
> =20
> -	finalize_hashfile(f, istate->oid.hash, CSUM_HASH_IN_STREAM);
> +	finalize_hashfile(f, istate->oid.hash, FSYNC_COMPONENT_NONE, CSUM_HASH_=
IN_STREAM);
>  	if (close_tempfile_gently(tempfile)) {
>  		error(_("could not close '%s'"), get_tempfile_path(tempfile));
>  		return -1;
> --=20
> gitgitgadget
>=20

--rj+/7kVG+u/4iNlT
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEF9hrgiFbCdvenl/rVbJhu7ckPpQFAmGvSzsACgkQVbJhu7ck
PpQzlg/9H/anSyKJQnHnBpnzSJ3FvmehwLWUbeM96Sk/zcA2q0A+Xb0nKe9vvBcY
LlGV/rxcAUuwZMlJ9pm88Q92EI3fUNUzqJP5lyaW48hUQsA91vSx9tZE2FiaDryp
V4gMzkc6tOhqu739HjSIBYkJ2XW+iabcpITNLodeKl0nFSKJ8rDTK3m69O0NEB4z
pGZs/agFpkI8FPBS/FKioVpmAOe/eM2mcWLGfL4gy2NjtO4FJsFNuP/sKtIxak6Q
0eaRgFcHDknkrvmgtmSCikZtCcuG58CnqcfVy1t1j2E8VwW3bvyrawBD7vWcFRb/
XALqryMaF3+8Rq/yhp7gCtfPqU18xSo/8/t6rjclBL0ldmYaBqXBLA8Ci9TXdDkh
QKlp28wMs4EnhthqXCjm/iUk+0izrDfPWbHPHnGhwiVqHP83fVF4+cNNbrYk6tM5
XaCrR5USoAjQ5IAlpOeuTr5tT2qVHBf5opiX69Ld+riIczFPrBFavV0DP32HMuPM
VOVBcveLBB+LVCR5a6JKMUxJXvypwNN61YOli7fSLhIzzW+voqIlu9q1VbdFZkkS
kscXsSTnKf+tGTEBZi2q5+SfpQFz4rfhAr710W+Zlbp5KZBgDR9v2qOKdnPfMYi4
qx1EfL4k9OmE9SLuvHMfWBiIiYdzJpUWMQ2oVO7hsfOTMKuYDvo=
=xPhw
-----END PGP SIGNATURE-----

--rj+/7kVG+u/4iNlT--

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Tue, Dec 7, 2021 at 3:54 AM Patrick Steinhardt <ps@pks.im> wrote:
>
> On Tue, Dec 07, 2021 at 02:46:50AM +0000, Neeraj Singh via GitGitGadget wrote:
> > From: Neeraj Singh <neerajsi@microsoft.com>
> [snip]
> > diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> > index c23d01de7dc..c32534c13b4 100644
> > --- a/builtin/index-pack.c
> > +++ b/builtin/index-pack.c
> > @@ -1286,7 +1286,7 @@ static void conclude_pack(int fix_thin_pack, const char *curr_pack, unsigned cha
> >                           nr_objects - nr_objects_initial);
> >               stop_progress_msg(&progress, msg.buf);
> >               strbuf_release(&msg);
> > -             finalize_hashfile(f, tail_hash, 0);
> > +             finalize_hashfile(f, tail_hash, FSYNC_COMPONENT_PACK, 0);
> >               hashcpy(read_hash, pack_hash);
> >               fixup_pack_header_footer(output_fd, pack_hash,
> >                                        curr_pack, nr_objects,
> > @@ -1508,7 +1508,7 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
> >       if (!from_stdin) {
> >               close(input_fd);
> >       } else {
> > -             fsync_or_die(output_fd, curr_pack_name);
> > +             fsync_component_or_die(FSYNC_COMPONENT_PACK, output_fd, curr_pack_name);
> >               err = close(output_fd);
> >               if (err)
> >                       die_errno(_("error while closing pack file"));
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 857be7826f3..916c55d6ce9 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
> >                * If so, rewrite it like in fast-import
> >                */
> >               if (pack_to_stdout) {
> > -                     finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> > +                     finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
> > +                                       CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>
> It doesn't have any effect here given that we don't sync at all when
> writing to stdout, but I wonder whether we should set up the component
> correctly regardless of that such that it makes for a less confusing
> read.
>

If it's not actually a file with some name known to git, is it really
a component of the repository? I'd like to leave this one as-is.

> [snip]
> > diff --git a/config.c b/config.c
> > index c3410b8a868..29c867aab03 100644
> > --- a/config.c
> > +++ b/config.c
> > @@ -1213,6 +1213,73 @@ static int git_parse_maybe_bool_text(const char *value)
> >       return -1;
> >  }
> >
> > +static const struct fsync_component_entry {
> > +     const char *name;
> > +     enum fsync_component component_bits;
> > +} fsync_component_table[] = {
> > +     { "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
> > +     { "pack", FSYNC_COMPONENT_PACK },
> > +     { "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
> > +     { "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
> > +     { "objects", FSYNC_COMPONENTS_OBJECTS },
> > +     { "default", FSYNC_COMPONENTS_DEFAULT },
> > +     { "all", FSYNC_COMPONENTS_ALL },
> > +};
> > +
> > +static enum fsync_component parse_fsync_components(const char *var, const char *string)
> > +{
> > +     enum fsync_component output = 0;
> > +
> > +     if (!strcmp(string, "none"))
> > +             return output;
> > +
> > +     while (string) {
> > +             int i;
> > +             size_t len;
> > +             const char *ep;
> > +             int negated = 0;
> > +             int found = 0;
> > +
> > +             string = string + strspn(string, ", \t\n\r");
> > +             ep = strchrnul(string, ',');
> > +             len = ep - string;
> > +
> > +             if (*string == '-') {
> > +                     negated = 1;
> > +                     string++;
> > +                     len--;
> > +                     if (!len)
> > +                             warning(_("invalid value for variable %s"), var);
> > +             }
> > +
> > +             if (!len)
> > +                     break;
> > +
> > +             for (i = 0; i < ARRAY_SIZE(fsync_component_table); ++i) {
> > +                     const struct fsync_component_entry *entry = &fsync_component_table[i];
> > +
> > +                     if (strncmp(entry->name, string, len))
> > +                             continue;
> > +
> > +                     found = 1;
> > +                     if (negated)
> > +                             output &= ~entry->component_bits;
> > +                     else
> > +                             output |= entry->component_bits;
> > +             }
> > +
> > +             if (!found) {
> > +                     char *component = xstrndup(string, len);
> > +                     warning(_("unknown %s value '%s'"), var, component);
> > +                     free(component);
> > +             }
> > +
> > +             string = ep;
> > +     }
> > +
> > +     return output;
> > +}
> > +
> >  int git_parse_maybe_bool(const char *value)
> >  {
> >       int v = git_parse_maybe_bool_text(value);
> > @@ -1490,6 +1557,13 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
> >               return 0;
> >       }
> >
> > +     if (!strcmp(var, "core.fsync")) {
> > +             if (!value)
> > +                     return config_error_nonbool(var);
> > +             fsync_components = parse_fsync_components(var, value);
> > +             return 0;
> > +     }
> > +
> >       if (!strcmp(var, "core.fsyncmethod")) {
> >               if (!value)
> >                       return config_error_nonbool(var);
> > @@ -1503,7 +1577,7 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
> >       }
> >
> >       if (!strcmp(var, "core.fsyncobjectfiles")) {
> > -             fsync_object_files = git_config_bool(var, value);
> > +             warning(_("core.fsyncobjectfiles is deprecated; use core.fsync instead"));
> >               return 0;
> >       }
>
> Shouldn't we continue to support this for now such that users can
> migrate from the old, deprecated value first before we start to ignore
> it?
>
> Patrick
>

That's a good question and one I was hoping to answer through this
discussion.  I'm guessing that most users do not have this setting
explicitly set, and it's largely a non-functional change. Git only
behaves differently in the extreme corner case of a system crash after
git exits.  That's why I believe it's okay to deprecate and remove in
one release.

If we choose to keep supporting the setting, it would introduce a
little complexity in the configuration code, but it should be doable.
I think the right semantics would be to ignore core.fsyncobjectfiles
if core.fsync is specified with any value. Otherwise,
core.fsyncobjectfiles would be equivalent to `default,-loose-object`
or `default,loose-object` for `false` and `true` respectively. I'd
prefer not to do this if I can get away with it :).

Thanks,
Neeraj

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 7, 2021

On the Git mailing list, Patrick Steinhardt wrote (reply to this):


--jJsNv0ui80iiTX3C
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Dec 07, 2021 at 02:46:48AM +0000, Neeraj K. Singh via GitGitGadget =
wrote:
> This is an implementation of an extensible configuration mechanism for
> fsyncing persistent components of a repo.
>=20
> The main goals are to separate the "what" to sync from the "how". There a=
re
> now two settings: core.fsync - Control the 'what', including the index.
> core.fsyncMethod - Control the 'how'. Currently we support writeout-only =
and
> full fsync.
>=20
> Syncing of refs can be layered on top of core.fsync. And batch mode will =
be
> layered on core.fsyncMethod.
>=20
> core.fsyncobjectfiles is removed and will issue a deprecation warning if
> it's seen.
>=20
> I'd like to get agreement on this direction before submitting batch mode =
to
> the list. The batch mode series is available to view at
> https://github.com/neerajsi-msft/git/pull/1.
>=20
> Please see [1], [2], and [3] for discussions that led to this series.
>=20
> V2 changes:
>=20
>  * Updated the documentation for core.fsyncmethod to be less certain.
>    writeout-only probably does not do the right thing on Linux.
>  * Split out the core.fsync=3Dindex change into its own commit.
>  * Rename REPO_COMPONENT to FSYNC_COMPONENT. This is really specific to
>    fsyncing, so the name should reflect that.
>  * Re-add missing Makefile change for SYNC_FILE_RANGE.
>  * Tested writeout-only mode, index syncing, and general config settings.
>=20
> [1] https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.c=
om/
> [2]
> https://lore.kernel.org/git/dd65718814011eb93ccc4428f9882e0f025224a6.1636=
029491.git.ps@pks.im/
> [3]
> https://lore.kernel.org/git/pull.1076.git.git.1629856292.gitgitgadget@gma=
il.com/

While I bail from the question of whether we want to grant as much
configurability to the user as this patch series does, I quite like the
implementation. It feels rather straight-forward and it's easy to see
how to extend it to support syncing of other subsystems like the loose
refs.

Thanks!

Patrick

> Neeraj Singh (3):
>   core.fsyncmethod: add writeout-only mode
>   core.fsync: introduce granular fsync control
>   core.fsync: new option to harden the index
>=20
>  Documentation/config/core.txt       | 35 +++++++++---
>  Makefile                            |  6 ++
>  builtin/fast-import.c               |  2 +-
>  builtin/index-pack.c                |  4 +-
>  builtin/pack-objects.c              |  8 ++-
>  bulk-checkin.c                      |  5 +-
>  cache.h                             | 48 +++++++++++++++-
>  commit-graph.c                      |  3 +-
>  compat/mingw.h                      |  3 +
>  compat/win32/flush.c                | 28 +++++++++
>  config.c                            | 89 ++++++++++++++++++++++++++++-
>  config.mak.uname                    |  3 +
>  configure.ac                        |  8 +++
>  contrib/buildsystems/CMakeLists.txt |  3 +-
>  csum-file.c                         |  5 +-
>  csum-file.h                         |  3 +-
>  environment.c                       |  3 +-
>  git-compat-util.h                   | 24 ++++++++
>  midx.c                              |  3 +-
>  object-file.c                       |  3 +-
>  pack-bitmap-write.c                 |  3 +-
>  pack-write.c                        | 13 +++--
>  read-cache.c                        | 19 ++++--
>  wrapper.c                           | 56 ++++++++++++++++++
>  write-or-die.c                      | 10 ++--
>  25 files changed, 344 insertions(+), 43 deletions(-)
>  create mode 100644 compat/win32/flush.c
>=20
>=20
> base-commit: abe6bb3905392d5eb6b01fa6e54d7e784e0522aa
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1093%2F=
neerajsi-msft%2Fns%2Fcore-fsync-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1093/neera=
jsi-msft/ns/core-fsync-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1093
>=20
> Range-diff vs v1:
>=20
>  1:  527380ddc3f ! 1:  e79522cbdd4 fsync: add writeout-only mode for fsyn=
cing repo data
>      @@ Metadata
>       Author: Neeraj Singh <neerajsi@microsoft.com>
>      =20
>        ## Commit message ##
>      -    fsync: add writeout-only mode for fsyncing repo data
>      +    core.fsyncmethod: add writeout-only mode
>      +
>      +    This commit introduces the `core.fsyncmethod` configuration
>      +    knob, which can currently be set to `fsync` or `writeout-only`.
>      =20
>           The new writeout-only mode attempts to tell the operating syste=
m to
>           flush its in-memory page cache to the storage hardware without =
issuing a
>      @@ Documentation/config/core.txt: core.whitespace::
>       +	using fsync and related primitives.
>       ++
>       +* `fsync` uses the fsync() system call or platform equivalents.
>      -+* `writeout-only` issues requests to send the writes to the storage
>      -+  hardware, but does not send any FLUSH CACHE request. If the oper=
ating system
>      -+  does not support the required interfaces, this falls back to fsy=
nc().
>      ++* `writeout-only` issues pagecache writeback requests, but dependi=
ng on the
>      ++  filesystem and storage hardware, data added to the repository ma=
y not be
>      ++  durable in the event of a system crash. This is the default mode=
 on macOS.
>       +
>        core.fsyncObjectFiles::
>        	This boolean will enable 'fsync()' when writing object files.
>        +
>      =20
>      + ## Makefile ##
>      +@@ Makefile: all::
>      + #
>      + # Define HAVE_CLOCK_MONOTONIC if your platform has CLOCK_MONOTONIC.
>      + #
>      ++# Define HAVE_SYNC_FILE_RANGE if your platform has sync_file_range.
>      ++#
>      + # Define NEEDS_LIBRT if your platform requires linking with librt =
(glibc version
>      + # before 2.17) for clock_gettime and CLOCK_MONOTONIC.
>      + #
>      +@@ Makefile: ifdef HAVE_CLOCK_MONOTONIC
>      + 	BASIC_CFLAGS +=3D -DHAVE_CLOCK_MONOTONIC
>      + endif
>      +=20
>      ++ifdef HAVE_SYNC_FILE_RANGE
>      ++	BASIC_CFLAGS +=3D -DHAVE_SYNC_FILE_RANGE
>      ++endif
>      ++
>      + ifdef NEEDS_LIBRT
>      + 	EXTLIBS +=3D -lrt
>      + endif
>      +
>        ## cache.h ##
>       @@ cache.h: extern int read_replace_refs;
>        extern char *git_replace_ref_base;
>  2:  23311a10142 ! 2:  ff80a94bf9a core.fsync: introduce granular fsync c=
ontrol
>      @@ Commit message
>           knob which can be used to control how components of the
>           repository are made durable on disk.
>      =20
>      -    This setting allows future extensibility of components
>      -    that could be synced in two ways:
>      +    This setting allows future extensibility of the list of
>      +    syncable components:
>           * We issue a warning rather than an error for unrecognized
>             components, so new configs can be used with old Git versions.
>           * We support negation, so users can choose one of the default
>             aggregate options and then remove components that they don't
>      -      want.
>      +      want. The user would then harden any new components added in
>      +      a Git version update.
>      =20
>      -    This also support the common request of doing absolutely no
>      +    This also supports the common request of doing absolutely no
>           fysncing with the `core.fsync=3Dnone` value, which is expected
>           to make the test suite faster.
>      =20
>      -    This commit introduces the new ability for the user to harden
>      -    the index, which is a requirement for being able to actually
>      -    find a file that has been added to the repo and then deleted
>      -    from the working tree.
>      -
>           Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
>      =20
>        ## Documentation/config/core.txt ##
>      @@ Documentation/config/core.txt: core.whitespace::
>       +	hardened via the core.fsyncMethod when created or modified. You c=
an
>       +	disable hardening of any component by prefixing it with a '-'. La=
ter
>       +	items take precedence over earlier ones in the list. For example,
>      -+	`core.fsync=3Dall,-index` means "harden everything except the ind=
ex".
>      -+	Items that are not hardened may be lost in the event of an unclean
>      -+	system shutdown.
>      ++	`core.fsync=3Dall,-pack-metadata` means "harden everything except=
 pack
>      ++	metadata." Items that are not hardened may be lost in the event o=
f an
>      ++	unclean system shutdown.
>       ++
>       +* `none` disables fsync completely. This must be specified alone.
>       +* `loose-object` hardens objects added to the repo in loose-object=
 form.
>       +* `pack` hardens objects added to the repo in packfile form.
>       +* `pack-metadata` hardens packfile bitmaps and indexes.
>       +* `commit-graph` hardens the commit graph file.
>      -+* `index` hardens the index when it is modified.
>       +* `objects` is an aggregate option that includes `loose-objects`, =
`pack`,
>       +  `pack-metadata`, and `commit-graph`.
>       +* `default` is an aggregate option that is equivalent to `objects,=
-loose-object`
>      @@ Documentation/config/core.txt: core.whitespace::
>        	A value indicating the strategy Git will use to harden repository=
 data
>        	using fsync and related primitives.
>       @@ Documentation/config/core.txt: core.fsyncMethod::
>      -   hardware, but does not send any FLUSH CACHE request. If the oper=
ating system
>      -   does not support the required interfaces, this falls back to fsy=
nc().
>      +   filesystem and storage hardware, data added to the repository ma=
y not be
>      +   durable in the event of a system crash. This is the default mode=
 on macOS.
>       =20
>       -core.fsyncObjectFiles::
>       -	This boolean will enable 'fsync()' when writing object files.
>      @@ builtin/fast-import.c: static void end_packfile(void)
>       =20
>        		close_pack_windows(pack_data);
>       -		finalize_hashfile(pack_file, cur_pack_oid.hash, 0);
>      -+		finalize_hashfile(pack_file, cur_pack_oid.hash, REPO_COMPONENT_P=
ACK, 0);
>      ++		finalize_hashfile(pack_file, cur_pack_oid.hash, FSYNC_COMPONENT_=
PACK, 0);
>        		fixup_pack_header_footer(pack_data->pack_fd, pack_data->hash,
>        					 pack_data->pack_name, object_count,
>        					 cur_pack_oid.hash, pack_size);
>      @@ builtin/index-pack.c: static void conclude_pack(int fix_thin_pack=
, const char *c
>        		stop_progress_msg(&progress, msg.buf);
>        		strbuf_release(&msg);
>       -		finalize_hashfile(f, tail_hash, 0);
>      -+		finalize_hashfile(f, tail_hash, REPO_COMPONENT_PACK, 0);
>      ++		finalize_hashfile(f, tail_hash, FSYNC_COMPONENT_PACK, 0);
>        		hashcpy(read_hash, pack_hash);
>        		fixup_pack_header_footer(output_fd, pack_hash,
>        					 curr_pack, nr_objects,
>      @@ builtin/index-pack.c: static void final(const char *final_pack_na=
me, const char
>        		close(input_fd);
>        	} else {
>       -		fsync_or_die(output_fd, curr_pack_name);
>      -+		fsync_component_or_die(REPO_COMPONENT_PACK, output_fd, curr_pack=
_name);
>      ++		fsync_component_or_die(FSYNC_COMPONENT_PACK, output_fd, curr_pac=
k_name);
>        		err =3D close(output_fd);
>        		if (err)
>        			die_errno(_("error while closing pack file"));
>      @@ builtin/pack-objects.c: static void write_pack_file(void)
>        		 */
>        		if (pack_to_stdout) {
>       -			finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>      -+			finalize_hashfile(f, hash, REPO_COMPONENT_NONE,
>      ++			finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
>       +					  CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>        		} else if (nr_written =3D=3D nr_remaining) {
>       -			finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC | C=
SUM_CLOSE);
>      -+			finalize_hashfile(f, hash, REPO_COMPONENT_PACK,
>      ++			finalize_hashfile(f, hash, FSYNC_COMPONENT_PACK,
>       +					  CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE);
>        		} else {
>       -			int fd =3D finalize_hashfile(f, hash, 0);
>      -+			int fd =3D finalize_hashfile(f, hash, REPO_COMPONENT_PACK, 0);
>      ++			int fd =3D finalize_hashfile(f, hash, FSYNC_COMPONENT_PACK, 0);
>        			fixup_pack_header_footer(fd, hash, pack_tmp_name,
>        						 nr_written, hash, offset);
>        			close(fd);
>      @@ bulk-checkin.c: static void finish_bulk_checkin(struct bulk_check=
in_state *state
>        		goto clear_exit;
>        	} else if (state->nr_written =3D=3D 1) {
>       -		finalize_hashfile(state->f, hash, CSUM_HASH_IN_STREAM | CSUM_FSY=
NC | CSUM_CLOSE);
>      -+		finalize_hashfile(state->f, hash, REPO_COMPONENT_PACK,
>      ++		finalize_hashfile(state->f, hash, FSYNC_COMPONENT_PACK,
>       +				  CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE);
>        	} else {
>       -		int fd =3D finalize_hashfile(state->f, hash, 0);
>      -+		int fd =3D finalize_hashfile(state->f, hash, REPO_COMPONENT_PACK=
, 0);
>      ++		int fd =3D finalize_hashfile(state->f, hash, FSYNC_COMPONENT_PAC=
K, 0);
>        		fixup_pack_header_footer(fd, hash, state->pack_tmp_name,
>        					 state->nr_written, hash,
>        					 state->offset);
>      @@ cache.h: void reset_shared_repository(void);
>       -extern int fsync_object_files;
>       +/*
>       + * These values are used to help identify parts of a repository to=
 fsync.
>      -+ * REPO_COMPONENT_NONE identifies data that will not be a persiste=
nt part of the
>      ++ * FSYNC_COMPONENT_NONE identifies data that will not be a persist=
ent part of the
>       + * repository and so shouldn't be fsynced.
>       + */
>      -+enum repo_component {
>      -+	REPO_COMPONENT_NONE			=3D 0,
>      -+	REPO_COMPONENT_LOOSE_OBJECT		=3D 1 << 0,
>      -+	REPO_COMPONENT_PACK			=3D 1 << 1,
>      -+	REPO_COMPONENT_PACK_METADATA		=3D 1 << 2,
>      -+	REPO_COMPONENT_COMMIT_GRAPH		=3D 1 << 3,
>      -+	REPO_COMPONENT_INDEX			=3D 1 << 4,
>      ++enum fsync_component {
>      ++	FSYNC_COMPONENT_NONE			=3D 0,
>      ++	FSYNC_COMPONENT_LOOSE_OBJECT		=3D 1 << 0,
>      ++	FSYNC_COMPONENT_PACK			=3D 1 << 1,
>      ++	FSYNC_COMPONENT_PACK_METADATA		=3D 1 << 2,
>      ++	FSYNC_COMPONENT_COMMIT_GRAPH		=3D 1 << 3,
>       +};
>       +
>      -+#define FSYNC_COMPONENTS_DEFAULT (REPO_COMPONENT_PACK | \
>      -+				  REPO_COMPONENT_PACK_METADATA | \
>      -+				  REPO_COMPONENT_COMMIT_GRAPH)
>      ++#define FSYNC_COMPONENTS_DEFAULT (FSYNC_COMPONENT_PACK | \
>      ++				  FSYNC_COMPONENT_PACK_METADATA | \
>      ++				  FSYNC_COMPONENT_COMMIT_GRAPH)
>       +
>      -+#define FSYNC_COMPONENTS_OBJECTS (REPO_COMPONENT_LOOSE_OBJECT | \
>      -+				  REPO_COMPONENT_PACK | \
>      -+				  REPO_COMPONENT_PACK_METADATA | \
>      -+				  REPO_COMPONENT_COMMIT_GRAPH)
>      ++#define FSYNC_COMPONENTS_OBJECTS (FSYNC_COMPONENT_LOOSE_OBJECT | \
>      ++				  FSYNC_COMPONENT_PACK | \
>      ++				  FSYNC_COMPONENT_PACK_METADATA | \
>      ++				  FSYNC_COMPONENT_COMMIT_GRAPH)
>       +
>      -+#define FSYNC_COMPONENTS_ALL (REPO_COMPONENT_LOOSE_OBJECT | \
>      -+			      REPO_COMPONENT_PACK | \
>      -+			      REPO_COMPONENT_PACK_METADATA | \
>      -+			      REPO_COMPONENT_COMMIT_GRAPH | \
>      -+			      REPO_COMPONENT_INDEX)
>      ++#define FSYNC_COMPONENTS_ALL (FSYNC_COMPONENT_LOOSE_OBJECT | \
>      ++			      FSYNC_COMPONENT_PACK | \
>      ++			      FSYNC_COMPONENT_PACK_METADATA | \
>      ++			      FSYNC_COMPONENT_COMMIT_GRAPH)
>       +
>       +
>       +/*
>       + * A bitmask indicating which components of the repo should be fsy=
nced.
>       + */
>      -+extern enum repo_component fsync_components;
>      ++extern enum fsync_component fsync_components;
>       =20
>        enum fsync_method {
>        	FSYNC_METHOD_FSYNC,
>      @@ cache.h: int copy_file_with_time(const char *dst, const char *src=
, int mode);
>        void write_or_die(int fd, const void *buf, size_t count);
>        void fsync_or_die(int fd, const char *);
>       =20
>      -+inline void fsync_component_or_die(enum repo_component component, =
int fd, const char *msg)
>      ++inline void fsync_component_or_die(enum fsync_component component,=
 int fd, const char *msg)
>       +{
>       +	if (fsync_components & component)
>       +		fsync_or_die(fd, msg);
>      @@ commit-graph.c: static int write_commit_graph_file(struct write_c=
ommit_graph_con
>       =20
>        	close_commit_graph(ctx->r->objects);
>       -	finalize_hashfile(f, file_hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
>      -+	finalize_hashfile(f, file_hash, REPO_COMPONENT_COMMIT_GRAPH,
>      ++	finalize_hashfile(f, file_hash, FSYNC_COMPONENT_COMMIT_GRAPH,
>       +			  CSUM_HASH_IN_STREAM | CSUM_FSYNC);
>        	free_chunkfile(cf);
>       =20
>      @@ config.c: static int git_parse_maybe_bool_text(const char *value)
>       =20
>       +static const struct fsync_component_entry {
>       +	const char *name;
>      -+	enum repo_component component_bits;
>      ++	enum fsync_component component_bits;
>       +} fsync_component_table[] =3D {
>      -+	{ "loose-object", REPO_COMPONENT_LOOSE_OBJECT },
>      -+	{ "pack", REPO_COMPONENT_PACK },
>      -+	{ "pack-metadata", REPO_COMPONENT_PACK_METADATA },
>      -+	{ "commit-graph", REPO_COMPONENT_COMMIT_GRAPH },
>      -+	{ "index", REPO_COMPONENT_INDEX },
>      ++	{ "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
>      ++	{ "pack", FSYNC_COMPONENT_PACK },
>      ++	{ "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
>      ++	{ "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
>       +	{ "objects", FSYNC_COMPONENTS_OBJECTS },
>       +	{ "default", FSYNC_COMPONENTS_DEFAULT },
>       +	{ "all", FSYNC_COMPONENTS_ALL },
>       +};
>       +
>      -+static enum repo_component parse_fsync_components(const char *var,=
 const char *string)
>      ++static enum fsync_component parse_fsync_components(const char *var=
, const char *string)
>       +{
>      -+	enum repo_component output =3D 0;
>      ++	enum fsync_component output =3D 0;
>       +
>       +	if (!strcmp(string, "none"))
>       +		return output;
>      @@ csum-file.c: static void free_hashfile(struct hashfile *f)
>       =20
>       -int finalize_hashfile(struct hashfile *f, unsigned char *result, u=
nsigned int flags)
>       +int finalize_hashfile(struct hashfile *f, unsigned char *result,
>      -+		      enum repo_component component, unsigned int flags)
>      ++		      enum fsync_component component, unsigned int flags)
>        {
>        	int fd;
>       =20
>      @@ csum-file.c: int finalize_hashfile(struct hashfile *f, unsigned c=
har *result, un
>        			die_errno("%s: sha1 file error on close", f->name);
>      =20
>        ## csum-file.h ##
>      +@@
>      + #ifndef CSUM_FILE_H
>      + #define CSUM_FILE_H
>      +=20
>      ++#include "cache.h"
>      + #include "hash.h"
>      +=20
>      + struct progress;
>       @@ csum-file.h: int hashfile_truncate(struct hashfile *, struct has=
hfile_checkpoint *);
>        struct hashfile *hashfd(int fd, const char *name);
>        struct hashfile *hashfd_check(const char *name);
>        struct hashfile *hashfd_throughput(int fd, const char *name, struc=
t progress *tp);
>       -int finalize_hashfile(struct hashfile *, unsigned char *, unsigned=
 int);
>      -+int finalize_hashfile(struct hashfile *, unsigned char *, enum rep=
o_component, unsigned int);
>      ++int finalize_hashfile(struct hashfile *, unsigned char *, enum fsy=
nc_component, unsigned int);
>        void hashwrite(struct hashfile *, const void *, unsigned int);
>        void hashflush(struct hashfile *f);
>        void crc32_begin(struct hashfile *);
>      @@ environment.c: const char *git_hooks_path;
>        int zlib_compression_level =3D Z_BEST_SPEED;
>        int pack_compression_level =3D Z_DEFAULT_COMPRESSION;
>        enum fsync_method fsync_method =3D FSYNC_METHOD_DEFAULT;
>      -+enum repo_component fsync_components =3D FSYNC_COMPONENTS_DEFAULT;
>      ++enum fsync_component fsync_components =3D FSYNC_COMPONENTS_DEFAULT;
>        size_t packed_git_window_size =3D DEFAULT_PACKED_GIT_WINDOW_SIZE;
>        size_t packed_git_limit =3D DEFAULT_PACKED_GIT_LIMIT;
>        size_t delta_base_cache_limit =3D 96 * 1024 * 1024;
>      @@ midx.c: static int write_midx_internal(const char *object_dir,
>        	write_chunkfile(cf, &ctx);
>       =20
>       -	finalize_hashfile(f, midx_hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM);
>      -+	finalize_hashfile(f, midx_hash, REPO_COMPONENT_PACK_METADATA,
>      ++	finalize_hashfile(f, midx_hash, FSYNC_COMPONENT_PACK_METADATA,
>       +			  CSUM_FSYNC | CSUM_HASH_IN_STREAM);
>        	free_chunkfile(cf);
>       =20
>      @@ object-file.c: int hash_object_file(const struct git_hash_algo *a=
lgo, const void
>        {
>       -	if (fsync_object_files)
>       -		fsync_or_die(fd, "loose object file");
>      -+	fsync_component_or_die(REPO_COMPONENT_LOOSE_OBJECT, fd, "loose ob=
ject file");
>      ++	fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd, "loose o=
bject file");
>        	if (close(fd) !=3D 0)
>        		die_errno(_("error when closing loose object file"));
>        }
>      @@ pack-bitmap-write.c: void bitmap_writer_finish(struct pack_idx_en=
try **index,
>        		write_hash_cache(f, index, index_nr);
>       =20
>       -	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSU=
M_CLOSE);
>      -+	finalize_hashfile(f, NULL, REPO_COMPONENT_PACK_METADATA,
>      ++	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
>       +			  CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE);
>       =20
>        	if (adjust_shared_perm(tmp_file.buf))
>      @@ pack-write.c: const char *write_idx_file(const char *index_name, =
struct pack_idx
>        	hashwrite(f, sha1, the_hash_algo->rawsz);
>       -	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE |
>       -				    ((opts->flags & WRITE_IDX_VERIFY)
>      -+	finalize_hashfile(f, NULL, REPO_COMPONENT_PACK_METADATA,
>      +-				    ? 0 : CSUM_FSYNC));
>      ++	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
>       +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE |
>      -+			  ((opts->flags & WRITE_IDX_VERIFY)
>      - 				    ? 0 : CSUM_FSYNC));
>      ++			  ((opts->flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
>        	return index_name;
>        }
>      +=20
>       @@ pack-write.c: const char *write_rev_file_order(const char *rev_n=
ame,
>        	if (rev_name && adjust_shared_perm(rev_name) < 0)
>        		die(_("failed to make %s readable"), rev_name);
>       =20
>       -	finalize_hashfile(f, NULL, CSUM_HASH_IN_STREAM | CSUM_CLOSE |
>       -				    ((flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
>      -+	finalize_hashfile(f, NULL, REPO_COMPONENT_PACK_METADATA,
>      ++	finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK_METADATA,
>       +			  CSUM_HASH_IN_STREAM | CSUM_CLOSE |
>       +			  ((flags & WRITE_IDX_VERIFY) ? 0 : CSUM_FSYNC));
>       =20
>      @@ pack-write.c: void fixup_pack_header_footer(int pack_fd,
>        	the_hash_algo->final_fn(new_pack_hash, &new_hash_ctx);
>        	write_or_die(pack_fd, new_pack_hash, the_hash_algo->rawsz);
>       -	fsync_or_die(pack_fd, pack_name);
>      -+	fsync_component_or_die(REPO_COMPONENT_PACK, pack_fd, pack_name);
>      ++	fsync_component_or_die(FSYNC_COMPONENT_PACK, pack_fd, pack_name);
>        }
>       =20
>        char *index_pack_lockfile(int ip_out, int *is_well_formed)
>      =20
>        ## read-cache.c ##
>      -@@ read-cache.c: static int record_ieot(void)
>      -  * rely on it.
>      -  */
>      - static int do_write_index(struct index_state *istate, struct tempf=
ile *tempfile,
>      --			  int strip_extensions)
>      -+			  int strip_extensions, unsigned flags)
>      - {
>      - 	uint64_t start =3D getnanotime();
>      - 	struct hashfile *f;
>      -@@ read-cache.c: static int do_write_index(struct index_state *ista=
te, struct tempfile *tempfile,
>      - 	struct strbuf previous_name_buf =3D STRBUF_INIT, *previous_name;
>      - 	int drop_cache_tree =3D istate->drop_cache_tree;
>      - 	off_t offset;
>      -+	int csum_fsync_flag;
>      - 	int ieot_entries =3D 1;
>      - 	struct index_entry_offset_table *ieot =3D NULL;
>      - 	int nr, nr_threads;
>       @@ read-cache.c: static int do_write_index(struct index_state *ista=
te, struct tempfile *tempfile,
>        			return -1;
>        	}
>       =20
>       -	finalize_hashfile(f, istate->oid.hash, CSUM_HASH_IN_STREAM);
>      -+	csum_fsync_flag =3D 0;
>      -+	if (!alternate_index_output && (flags & COMMIT_LOCK))
>      -+		csum_fsync_flag =3D CSUM_FSYNC;
>      -+
>      -+	finalize_hashfile(f, istate->oid.hash, REPO_COMPONENT_INDEX,
>      -+			  CSUM_HASH_IN_STREAM | csum_fsync_flag);
>      -+
>      ++	finalize_hashfile(f, istate->oid.hash, FSYNC_COMPONENT_NONE, CSUM=
_HASH_IN_STREAM);
>        	if (close_tempfile_gently(tempfile)) {
>        		error(_("could not close '%s'"), get_tempfile_path(tempfile));
>        		return -1;
>      -@@ read-cache.c: static int do_write_locked_index(struct index_stat=
e *istate, struct lock_file *l
>      - 	 */
>      - 	trace2_region_enter_printf("index", "do_write_index", the_reposit=
ory,
>      - 				   "%s", get_lock_file_path(lock));
>      --	ret =3D do_write_index(istate, lock->tempfile, 0);
>      -+	ret =3D do_write_index(istate, lock->tempfile, 0, flags);
>      - 	trace2_region_leave_printf("index", "do_write_index", the_reposit=
ory,
>      - 				   "%s", get_lock_file_path(lock));
>      -=20
>      -@@ read-cache.c: static int clean_shared_index_files(const char *cu=
rrent_hex)
>      - }
>      -=20
>      - static int write_shared_index(struct index_state *istate,
>      --			      struct tempfile **temp)
>      -+			      struct tempfile **temp, unsigned flags)
>      - {
>      - 	struct split_index *si =3D istate->split_index;
>      - 	int ret, was_full =3D !istate->sparse_index;
>      -@@ read-cache.c: static int write_shared_index(struct index_state *=
istate,
>      -=20
>      - 	trace2_region_enter_printf("index", "shared/do_write_index",
>      - 				   the_repository, "%s", get_tempfile_path(*temp));
>      --	ret =3D do_write_index(si->base, *temp, 1);
>      -+	ret =3D do_write_index(si->base, *temp, 1, flags);
>      - 	trace2_region_leave_printf("index", "shared/do_write_index",
>      - 				   the_repository, "%s", get_tempfile_path(*temp));
>      -=20
>      -@@ read-cache.c: int write_locked_index(struct index_state *istate,=
 struct lock_file *lock,
>      - 			ret =3D do_write_locked_index(istate, lock, flags);
>      - 			goto out;
>      - 		}
>      --		ret =3D write_shared_index(istate, &temp);
>      -+		ret =3D write_shared_index(istate, &temp, flags);
>      -=20
>      - 		saved_errno =3D errno;
>      - 		if (is_tempfile_active(temp))
>  -:  ----------- > 3:  86e39b8f8d1 core.fsync: new option to harden the i=
ndex
>=20
> --=20
> gitgitgadget

--jJsNv0ui80iiTX3C
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEF9hrgiFbCdvenl/rVbJhu7ckPpQFAmGvS+MACgkQVbJhu7ck
PpQv0Q/9GZfj5q4KMTcGuVtsXC0Qkj+JfQ822pNDq2C41jtxJp9acx2D8bqREAkl
QG+YzgjXyq1pik0hezQZr7OLoYV65GUye2as0q4WJlMpd1SKYRZh2uukqFNzILql
AintZfTSUNFRz00xHkRBGiwtxFeCNn2saegJemzQdu/pVDA5cuIfAS17fGgRrCIk
ZhFkIObr8ZYqUHzI6PKG6I7SlzwDx9iGcjDx+4DIlRhN3skBIRvhIKN9TA/dfvdz
CbQ0HE7ID64LIIxX+1e50D9nWuO6TmSHeOFjyUmnZLTmzbVp+edq/WuCNmWvU0BR
yXGj8Ia+y72fqfTQBJHsZ/8gvbyAtV6sAe9NuGiOM2JbSm82HowQGtmMml/RELVT
eRCVTyttCS4wkYuZb8wUgD5C1inYHDVIUcY90n69T7Uy3XPZRuOJ/Qm+qddKGTE0
r5+G8+guCI8b4gMdsS50X1sLgkbJHc+sBd4zcV4LSzwyQX/+TxiA0c5Gl6d9AdIv
f998RMpfljJ5hdNTUtX1YKsNmO9HSM8X2EHRp3z5hi7VxdK9EH/2lNekjaDsqYNI
JgcYmuWH7TS0bKXu9dZA78gf2t4PKoe2xBGAeP/CRHKx9VyRwWJpWBdudh04NRI6
QkFnwu0jTh4cD0FzE4oIj5hpvKFPuEtdHI2TWak49McEz+U2GZk=
=U8Y2
-----END PGP SIGNATURE-----

--jJsNv0ui80iiTX3C--

@@ -547,6 +547,15 @@ core.whitespace::
is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> This commit introduces the `core.fsyncmethod` configuration

Just a commit msg nit: core.fsyncMethod (I see the docs etc. are using
it camelCased, good..

> diff --git a/compat/win32/flush.c b/compat/win32/flush.c
> new file mode 100644
> index 00000000000..75324c24ee7
> --- /dev/null
> +++ b/compat/win32/flush.c
> @@ -0,0 +1,28 @@
> +#include "../../git-compat-util.h"

nit: Just FWIW I think the better thing is '#include
"git-compat-util.h"', i.e. we're compiling at the top-level and have
added it to -I.

(I know a lot of compat/ and contrib/ and even main-tree stuff does
that, but just FWIW it's not needed).

> +	if (!strcmp(var, "core.fsyncmethod")) {
> +		if (!value)
> +			return config_error_nonbool(var);
> +		if (!strcmp(value, "fsync"))
> +			fsync_method = FSYNC_METHOD_FSYNC;
> +		else if (!strcmp(value, "writeout-only"))
> +			fsync_method = FSYNC_METHOD_WRITEOUT_ONLY;
> +		else

As a non-nit comment I think this config schema looks great so far.

> +			warning(_("unknown %s value '%s'"), var, value);

Just a suggestion maybe something slightly scarier like:

    "unknown core.fsyncMethod value '%s'; config from future git version? ignoring requested fsync strategy"

Also using the nicer camelCased version instead of "var" (also helps
translators with context...)

> +int git_fsync(int fd, enum fsync_action action)
> +{
> +	switch (action) {
> +	case FSYNC_WRITEOUT_ONLY:
> +
> +#ifdef __APPLE__
> +		/*
> +		 * on macOS, fsync just causes filesystem cache writeback but does not
> +		 * flush hardware caches.
> +		 */
> +		return fsync(fd);
> +#endif
> +
> +#ifdef HAVE_SYNC_FILE_RANGE
> +		/*
> +		 * On linux 2.6.17 and above, sync_file_range is the way to issue
> +		 * a writeback without a hardware flush. An offset of 0 and size of 0
> +		 * indicates writeout of the entire file and the wait flags ensure that all
> +		 * dirty data is written to the disk (potentially in a disk-side cache)
> +		 * before we continue.
> +		 */
> +
> +		return sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE |
> +						 SYNC_FILE_RANGE_WRITE |
> +						 SYNC_FILE_RANGE_WAIT_AFTER);
> +#endif
> +
> +#ifdef fsync_no_flush
> +		return fsync_no_flush(fd);
> +#endif
> +
> +		errno = ENOSYS;
> +		return -1;
> +
> +	case FSYNC_HARDWARE_FLUSH:
> +		/*
> +		 * On some platforms fsync may return EINTR. Try again in this
> +		 * case, since callers asking for a hardware flush may die if
> +		 * this function returns an error.
> +		 */
> +		for (;;) {
> +			int err;
> +#ifdef __APPLE__
> +			err = fcntl(fd, F_FULLFSYNC);
> +#else
> +			err = fsync(fd);
> +#endif
> +			if (err >= 0 || errno != EINTR)
> +				return err;
> +		}
> +
> +	default:
> +		BUG("unexpected git_fsync(%d) call", action);

Don't include such "default" cases, you have an exhaustive "enum", if
you skip it the compiler will check this for you.

> +	}
> +}
> +
>  static int warn_if_unremovable(const char *op, const char *file, int rc)

Just a code nit: I think it's very much preferred if possible to have as
much of code like this compile on all platforms. See the series at
4002e87cb25 (grep: remove #ifdef NO_PTHREADS, 2018-11-03) is part of for
a good example.

Maybe not worth it in this case since they're not nested ifdef's.

I'm basically thinking of something (also re Patrick's comment on the
2nd patch) where we have a platform_fsync() whose return
value/arguments/whatever capture this "I want to return now" or "you
should be looping" and takes the enum_fsync_action" strategy.

Then the git_fsync() would be the platform-independent looping etc., and
another funciton would do the "one fsync at a time, maybe call me
again".

Maybe it would suck more, just food for thought... :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Tue, Dec 7, 2021 at 4:27 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > This commit introduces the `core.fsyncmethod` configuration
>
> Just a commit msg nit: core.fsyncMethod (I see the docs etc. are using
> it camelCased, good..

Will fix.

> > diff --git a/compat/win32/flush.c b/compat/win32/flush.c
> > new file mode 100644
> > index 00000000000..75324c24ee7
> > --- /dev/null
> > +++ b/compat/win32/flush.c
> > @@ -0,0 +1,28 @@
> > +#include "../../git-compat-util.h"
>
> nit: Just FWIW I think the better thing is '#include
> "git-compat-util.h"', i.e. we're compiling at the top-level and have
> added it to -I.
>
> (I know a lot of compat/ and contrib/ and even main-tree stuff does
> that, but just FWIW it's not needed).
>

Will fix.

> > +     if (!strcmp(var, "core.fsyncmethod")) {
> > +             if (!value)
> > +                     return config_error_nonbool(var);
> > +             if (!strcmp(value, "fsync"))
> > +                     fsync_method = FSYNC_METHOD_FSYNC;
> > +             else if (!strcmp(value, "writeout-only"))
> > +                     fsync_method = FSYNC_METHOD_WRITEOUT_ONLY;
> > +             else
>
> As a non-nit comment I think this config schema looks great so far.
>
> > +                     warning(_("unknown %s value '%s'"), var, value);
>
> Just a suggestion maybe something slightly scarier like:
>
>     "unknown core.fsyncMethod value '%s'; config from future git version? ignoring requested fsync strategy"
>
> Also using the nicer camelCased version instead of "var" (also helps
> translators with context...)
>

Will fix.  The motivation for this scheme was to 'factor' the messages
so there would be less to translate. But I see now that the message
doesn't have enough context to translate reasonably.

> > +int git_fsync(int fd, enum fsync_action action)
> > +{
> > +     switch (action) {
> > +     case FSYNC_WRITEOUT_ONLY:
> > +
> > +#ifdef __APPLE__
> > +             /*
> > +              * on macOS, fsync just causes filesystem cache writeback but does not
> > +              * flush hardware caches.
> > +              */
> > +             return fsync(fd);
> > +#endif
> > +
> > +#ifdef HAVE_SYNC_FILE_RANGE
> > +             /*
> > +              * On linux 2.6.17 and above, sync_file_range is the way to issue
> > +              * a writeback without a hardware flush. An offset of 0 and size of 0
> > +              * indicates writeout of the entire file and the wait flags ensure that all
> > +              * dirty data is written to the disk (potentially in a disk-side cache)
> > +              * before we continue.
> > +              */
> > +
> > +             return sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE |
> > +                                              SYNC_FILE_RANGE_WRITE |
> > +                                              SYNC_FILE_RANGE_WAIT_AFTER);
> > +#endif
> > +
> > +#ifdef fsync_no_flush
> > +             return fsync_no_flush(fd);
> > +#endif
> > +
> > +             errno = ENOSYS;
> > +             return -1;
> > +
> > +     case FSYNC_HARDWARE_FLUSH:
> > +             /*
> > +              * On some platforms fsync may return EINTR. Try again in this
> > +              * case, since callers asking for a hardware flush may die if
> > +              * this function returns an error.
> > +              */
> > +             for (;;) {
> > +                     int err;
> > +#ifdef __APPLE__
> > +                     err = fcntl(fd, F_FULLFSYNC);
> > +#else
> > +                     err = fsync(fd);
> > +#endif
> > +                     if (err >= 0 || errno != EINTR)
> > +                             return err;
> > +             }
> > +
> > +     default:
> > +             BUG("unexpected git_fsync(%d) call", action);
>
> Don't include such "default" cases, you have an exhaustive "enum", if
> you skip it the compiler will check this for you.
>

Junio gave the feedback to include this "default:" case in the switch
[1].  Removing the default leads to the "error: control reaches end of
non-void function" on gcc. Fixing that error and adding a trial option
does give the exhaustiveness error that you're talking about.  I'd
rather just leave this as-is since the BUG() obviates the need for an
in-practice-unreachable return statement.

[1] https://lore.kernel.org/git/xmqqfsu70x58.fsf@gitster.g/

> > +     }
> > +}
> > +
> >  static int warn_if_unremovable(const char *op, const char *file, int rc)
>
> Just a code nit: I think it's very much preferred if possible to have as
> much of code like this compile on all platforms. See the series at
> 4002e87cb25 (grep: remove #ifdef NO_PTHREADS, 2018-11-03) is part of for
> a good example.
>
> Maybe not worth it in this case since they're not nested ifdef's.
>
> I'm basically thinking of something (also re Patrick's comment on the
> 2nd patch) where we have a platform_fsync() whose return
> value/arguments/whatever capture this "I want to return now" or "you
> should be looping" and takes the enum_fsync_action" strategy.
>
> Then the git_fsync() would be the platform-independent looping etc., and
> another funciton would do the "one fsync at a time, maybe call me
> again".
>
> Maybe it would suck more, just food for thought... :)

I'm going to introduce a new static function called fsync_loop which
does the looping and I'll call it from git_fsync.  That appears to be
the cleanest to me to address your and Patrick's feedback.

Thanks,
Neeraj

@@ -547,13 +547,33 @@ core.whitespace::
is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> This commit introduces the `core.fsync` configuration
> knob which can be used to control how components of the
> repository are made durable on disk.
>
> This setting allows future extensibility of the list of
> syncable components:
> * We issue a warning rather than an error for unrecognized
>   components, so new configs can be used with old Git versions.

Looks good!

> * We support negation, so users can choose one of the default
>   aggregate options and then remove components that they don't
>   want. The user would then harden any new components added in
>   a Git version update.

I think this config schema makes sense, but just a (I think important)
comment on the "how" not "what" of it. It's really much better to define
config as:

    [some]
    key = value
    key = value2

Than:

    [some]
    key = value,value2

The reason is that "git config" has good support for working with
multi-valued stuff, so you can do e.g.:

    git config --get-all -z some.key

And you can easily (re)set such config e.g. with --replace-all etc., but
for comma-delimited you (and users) need to do all that work themselves.

Similarly instead of:

    some.key = want-this
    some.key = -not-this
    some.key = but-want-this

I think it's better to just have two lists, one inclusive another
exclusive. E.g. see "log.decorate" and "log.excludeDecoration",
"transfer.hideRefs"

Which would mean:

    core.fsync = want-this
    core.fsyncExcludes = -not-this

For some value of "fsyncExcludes", maybe "noFsync"? Anyway, just a
suggestion on making this easier for users & the implementation.

> This also supports the common request of doing absolutely no
> fysncing with the `core.fsync=none` value, which is expected
> to make the test suite faster.

Let's just use the git_parse_maybe_bool() or git_parse_maybe_bool_text()
so we'll accept "false", "off", "no" like most other such config?

> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> ---
>  Documentation/config/core.txt | 27 +++++++++----
>  builtin/fast-import.c         |  2 +-
>  builtin/index-pack.c          |  4 +-
>  builtin/pack-objects.c        |  8 ++--
>  bulk-checkin.c                |  5 ++-
>  cache.h                       | 39 +++++++++++++++++-
>  commit-graph.c                |  3 +-
>  config.c                      | 76 ++++++++++++++++++++++++++++++++++-
>  csum-file.c                   |  5 ++-
>  csum-file.h                   |  3 +-
>  environment.c                 |  1 +
>  midx.c                        |  3 +-
>  object-file.c                 |  3 +-
>  pack-bitmap-write.c           |  3 +-
>  pack-write.c                  | 13 +++---
>  read-cache.c                  |  2 +-
>  16 files changed, 164 insertions(+), 33 deletions(-)
>
> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> index dbb134f7136..4f1747ec871 100644
> --- a/Documentation/config/core.txt
> +++ b/Documentation/config/core.txt
> @@ -547,6 +547,25 @@ core.whitespace::
>    is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
>    errors. The default tab width is 8. Allowed values are 1 to 63.
>  
> +core.fsync::
> +	A comma-separated list of parts of the repository which should be
> +	hardened via the core.fsyncMethod when created or modified. You can
> +	disable hardening of any component by prefixing it with a '-'. Later
> +	items take precedence over earlier ones in the list. For example,
> +	`core.fsync=all,-pack-metadata` means "harden everything except pack
> +	metadata." Items that are not hardened may be lost in the event of an
> +	unclean system shutdown.
> ++
> +* `none` disables fsync completely. This must be specified alone.
> +* `loose-object` hardens objects added to the repo in loose-object form.
> +* `pack` hardens objects added to the repo in packfile form.
> +* `pack-metadata` hardens packfile bitmaps and indexes.
> +* `commit-graph` hardens the commit graph file.
> +* `objects` is an aggregate option that includes `loose-objects`, `pack`,
> +  `pack-metadata`, and `commit-graph`.
> +* `default` is an aggregate option that is equivalent to `objects,-loose-object`
> +* `all` is an aggregate option that syncs all individual components above.
> +

It's probably a *bit* more work to set up, but I wonder if this wouldn't
be simpler if we just said (and this is partially going against what I
noted above):

== BEGIN DOC

core.fsync is a multi-value config variable where each item is a
pathspec that'll get matched the same way as 'git-ls-files' et al.

When we sync pretend that a path like .git/objects/de/adbeef... is
relative to the top-level of the git
directory. E.g. "objects/de/adbeaf.." or "objects/pack/...".

You can then supply a list of wildcards and exclusions to configure
syncing.  or "false", "off" etc. to turn it off. These are synonymous
with:

    ; same as "false"
    core.fsync = ":!*"

Or:

    ; same as "true"
    core.fsync = "*"

Or, to selectively sync some things and not others:

    ;; Sync objects, but not "info"
    core.fsync = ":!objects/info/**"
    core.fsync = "objects/**"

See gitrepository-layout(5) for details about what sort of paths you
might be expected to match. Not all paths listed there will go through
this mechanism (e.g. currently objects do, but nothing to do with config
does).

We can and will match this against "fake paths", e.g. when writing out
packs we may match against just the string "objects/pack", we're not
going to re-check if every packfile we're writing matches your globs,
ditto for loose objects. Be reasonable!

This metharism is intended as a shorthand that provides some flexibility
when fsyncing, while not forcing git to come up with labels for all
paths the git dir, or to support crazyness like "objects/de/adbeef*"

More paths may be added or removed in the future, and we make no
promises that we won't move things around, so if in doubt use
e.g. "true" or a wide pattern match like "objects/**". When in doubt
stick to the golden path of examples provided in this documentation.

== END DOC


It's a tad more complex to set up, but I wonder if that isn't worth
it. It nicely gets around any current and future issues of deciding what
labels such as "loose-object" etc. to pick, as well as slotting into an
existing method of doing exclude/include lists.

> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 857be7826f3..916c55d6ce9 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
>  		 * If so, rewrite it like in fast-import
>  		 */
>  		if (pack_to_stdout) {
> -			finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> +			finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
> +					  CSUM_HASH_IN_STREAM | CSUM_CLOSE);

Not really related to this per-se, but since you're touching the API
everything goes through I wonder if callers should just always try to
fsync, and we can just catch EROFS and EINVAL in the wrapper if someone
tries to flush stdout, or catch the fd at that lower level.

Or maybe there's a good reason for this...

> [...]
> +/*
> + * These values are used to help identify parts of a repository to fsync.
> + * FSYNC_COMPONENT_NONE identifies data that will not be a persistent part of the
> + * repository and so shouldn't be fsynced.
> + */
> +enum fsync_component {
> +	FSYNC_COMPONENT_NONE			= 0,

I haven't read ahead much but in most other such cases we don't define
the "= 0", just start at 1<<0, then check the flags elsewhere...

> +static const struct fsync_component_entry {
> +	const char *name;
> +	enum fsync_component component_bits;
> +} fsync_component_table[] = {
> +	{ "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
> +	{ "pack", FSYNC_COMPONENT_PACK },
> +	{ "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
> +	{ "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
> +	{ "objects", FSYNC_COMPONENTS_OBJECTS },
> +	{ "default", FSYNC_COMPONENTS_DEFAULT },
> +	{ "all", FSYNC_COMPONENTS_ALL },
> +};
> +
> +static enum fsync_component parse_fsync_components(const char *var, const char *string)
> +{
> +	enum fsync_component output = 0;
> +
> +	if (!strcmp(string, "none"))
> +		return output;
> +
> +	while (string) {
> +		int i;
> +		size_t len;
> +		const char *ep;
> +		int negated = 0;
> +		int found = 0;
> +
> +		string = string + strspn(string, ", \t\n\r");

Aside from the "use a list" isn't this hardcoding some windows-specific
assumptions with \n\r? Maybe not...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Tue, Dec 7, 2021 at 5:01 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > This commit introduces the `core.fsync` configuration
> > knob which can be used to control how components of the
> > repository are made durable on disk.
> >
> > This setting allows future extensibility of the list of
> > syncable components:
> > * We issue a warning rather than an error for unrecognized
> >   components, so new configs can be used with old Git versions.
>
> Looks good!
>
> > * We support negation, so users can choose one of the default
> >   aggregate options and then remove components that they don't
> >   want. The user would then harden any new components added in
> >   a Git version update.
>
> I think this config schema makes sense, but just a (I think important)
> comment on the "how" not "what" of it. It's really much better to define
> config as:
>
>     [some]
>     key = value
>     key = value2
>
> Than:
>
>     [some]
>     key = value,value2
>
> The reason is that "git config" has good support for working with
> multi-valued stuff, so you can do e.g.:
>
>     git config --get-all -z some.key
>
> And you can easily (re)set such config e.g. with --replace-all etc., but
> for comma-delimited you (and users) need to do all that work themselves.
>
> Similarly instead of:
>
>     some.key = want-this
>     some.key = -not-this
>     some.key = but-want-this
>
> I think it's better to just have two lists, one inclusive another
> exclusive. E.g. see "log.decorate" and "log.excludeDecoration",
> "transfer.hideRefs"
>
> Which would mean:
>
>     core.fsync = want-this
>     core.fsyncExcludes = -not-this
>
> For some value of "fsyncExcludes", maybe "noFsync"? Anyway, just a
> suggestion on making this easier for users & the implementation.
>

Maybe there's some way to handle this I'm unaware of, but a
disadvantage of your multi-valued config proposal is that it's harder,
for example, for a per-repo config store to reasonably override a
per-user config store.  With the configuration scheme as-is, I can
have a per-user setting like `core.fsync=all` which covers my typical
repos, but then have a maintainer repo with a private setting of
`core.fsync=none` to speed up cases where I'm mostly working with
other people's changes that are backed up in email or server-side
repos.  The latter setting conveniently overrides the former setting
in all aspects.

Also, with the core.fsync and core.fsyncExcludes, how would you spell
"don't sync anything"? Would you still have the aggregate options.?

> > This also supports the common request of doing absolutely no
> > fysncing with the `core.fsync=none` value, which is expected
> > to make the test suite faster.
>
> Let's just use the git_parse_maybe_bool() or git_parse_maybe_bool_text()
> so we'll accept "false", "off", "no" like most other such config?

Junio's previous feedback when discussing batch mode [1] was to offer
less flexibility when parsing new values of these configuration
options. I agree with his statement that "making machine-readable
tokens be spelled in different ways is a 'disease'."  I'd like to
leave this as-is so that the documentation can clearly state the exact
set of allowable values.

[1] https://lore.kernel.org/git/xmqqr1dqzyl7.fsf@gitster.g/

>
> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> > ---
> >  Documentation/config/core.txt | 27 +++++++++----
> >  builtin/fast-import.c         |  2 +-
> >  builtin/index-pack.c          |  4 +-
> >  builtin/pack-objects.c        |  8 ++--
> >  bulk-checkin.c                |  5 ++-
> >  cache.h                       | 39 +++++++++++++++++-
> >  commit-graph.c                |  3 +-
> >  config.c                      | 76 ++++++++++++++++++++++++++++++++++-
> >  csum-file.c                   |  5 ++-
> >  csum-file.h                   |  3 +-
> >  environment.c                 |  1 +
> >  midx.c                        |  3 +-
> >  object-file.c                 |  3 +-
> >  pack-bitmap-write.c           |  3 +-
> >  pack-write.c                  | 13 +++---
> >  read-cache.c                  |  2 +-
> >  16 files changed, 164 insertions(+), 33 deletions(-)
> >
> > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> > index dbb134f7136..4f1747ec871 100644
> > --- a/Documentation/config/core.txt
> > +++ b/Documentation/config/core.txt
> > @@ -547,6 +547,25 @@ core.whitespace::
> >    is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
> >    errors. The default tab width is 8. Allowed values are 1 to 63.
> >
> > +core.fsync::
> > +     A comma-separated list of parts of the repository which should be
> > +     hardened via the core.fsyncMethod when created or modified. You can
> > +     disable hardening of any component by prefixing it with a '-'. Later
> > +     items take precedence over earlier ones in the list. For example,
> > +     `core.fsync=all,-pack-metadata` means "harden everything except pack
> > +     metadata." Items that are not hardened may be lost in the event of an
> > +     unclean system shutdown.
> > ++
> > +* `none` disables fsync completely. This must be specified alone.
> > +* `loose-object` hardens objects added to the repo in loose-object form.
> > +* `pack` hardens objects added to the repo in packfile form.
> > +* `pack-metadata` hardens packfile bitmaps and indexes.
> > +* `commit-graph` hardens the commit graph file.
> > +* `objects` is an aggregate option that includes `loose-objects`, `pack`,
> > +  `pack-metadata`, and `commit-graph`.
> > +* `default` is an aggregate option that is equivalent to `objects,-loose-object`
> > +* `all` is an aggregate option that syncs all individual components above.
> > +
>
> It's probably a *bit* more work to set up, but I wonder if this wouldn't
> be simpler if we just said (and this is partially going against what I
> noted above):
>
> == BEGIN DOC
>
> core.fsync is a multi-value config variable where each item is a
> pathspec that'll get matched the same way as 'git-ls-files' et al.
>
> When we sync pretend that a path like .git/objects/de/adbeef... is
> relative to the top-level of the git
> directory. E.g. "objects/de/adbeaf.." or "objects/pack/...".
>
> You can then supply a list of wildcards and exclusions to configure
> syncing.  or "false", "off" etc. to turn it off. These are synonymous
> with:
>
>     ; same as "false"
>     core.fsync = ":!*"
>
> Or:
>
>     ; same as "true"
>     core.fsync = "*"
>
> Or, to selectively sync some things and not others:
>
>     ;; Sync objects, but not "info"
>     core.fsync = ":!objects/info/**"
>     core.fsync = "objects/**"
>
> See gitrepository-layout(5) for details about what sort of paths you
> might be expected to match. Not all paths listed there will go through
> this mechanism (e.g. currently objects do, but nothing to do with config
> does).
>
> We can and will match this against "fake paths", e.g. when writing out
> packs we may match against just the string "objects/pack", we're not
> going to re-check if every packfile we're writing matches your globs,
> ditto for loose objects. Be reasonable!
>
> This metharism is intended as a shorthand that provides some flexibility
> when fsyncing, while not forcing git to come up with labels for all
> paths the git dir, or to support crazyness like "objects/de/adbeef*"
>
> More paths may be added or removed in the future, and we make no
> promises that we won't move things around, so if in doubt use
> e.g. "true" or a wide pattern match like "objects/**". When in doubt
> stick to the golden path of examples provided in this documentation.
>
> == END DOC
>
>
> It's a tad more complex to set up, but I wonder if that isn't worth
> it. It nicely gets around any current and future issues of deciding what
> labels such as "loose-object" etc. to pick, as well as slotting into an
> existing method of doing exclude/include lists.
>

I think this proposal is a lot of complexity to avoid coming up with a
new name for syncable things as they are added to Git.  A path based
mechanism makes it hard to document for the (advanced) user what the
full set of things is and how it might change from release to release.
I think the current core.fsync scheme is a bit easier to understand,
query, and extend.

> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 857be7826f3..916c55d6ce9 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
> >                * If so, rewrite it like in fast-import
> >                */
> >               if (pack_to_stdout) {
> > -                     finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> > +                     finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
> > +                                       CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>
> Not really related to this per-se, but since you're touching the API
> everything goes through I wonder if callers should just always try to
> fsync, and we can just catch EROFS and EINVAL in the wrapper if someone
> tries to flush stdout, or catch the fd at that lower level.
>
> Or maybe there's a good reason for this...

It's platform dependent, but I'd expect fsync would do something for
pipes or stdout redirected to a file.  In these cases we really don't
want to fsync since we have no idea what we're talking to and we're
potentially worsening performance for probably no benefit.

> > [...]
> > +/*
> > + * These values are used to help identify parts of a repository to fsync.
> > + * FSYNC_COMPONENT_NONE identifies data that will not be a persistent part of the
> > + * repository and so shouldn't be fsynced.
> > + */
> > +enum fsync_component {
> > +     FSYNC_COMPONENT_NONE                    = 0,
>
> I haven't read ahead much but in most other such cases we don't define
> the "= 0", just start at 1<<0, then check the flags elsewhere...
>
> > +static const struct fsync_component_entry {
> > +     const char *name;
> > +     enum fsync_component component_bits;
> > +} fsync_component_table[] = {
> > +     { "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
> > +     { "pack", FSYNC_COMPONENT_PACK },
> > +     { "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
> > +     { "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
> > +     { "objects", FSYNC_COMPONENTS_OBJECTS },
> > +     { "default", FSYNC_COMPONENTS_DEFAULT },
> > +     { "all", FSYNC_COMPONENTS_ALL },
> > +};
> > +
> > +static enum fsync_component parse_fsync_components(const char *var, const char *string)
> > +{
> > +     enum fsync_component output = 0;
> > +
> > +     if (!strcmp(string, "none"))
> > +             return output;
> > +
> > +     while (string) {
> > +             int i;
> > +             size_t len;
> > +             const char *ep;
> > +             int negated = 0;
> > +             int found = 0;
> > +
> > +             string = string + strspn(string, ", \t\n\r");
>
> Aside from the "use a list" isn't this hardcoding some windows-specific
> assumptions with \n\r? Maybe not...

I shamelessly stole this code from parse_whitespace_rule. I thought
about making a helper to be called by both functions, but the amount
of state going into and out of the wrapper via arguments was
substantial and seemed to negate the benefit of deduplication.

Thanks for the review,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Dec 07 2021, Neeraj Singh wrote:

> On Tue, Dec 7, 2021 at 5:01 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>
>>
>> On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:
>>
>> > From: Neeraj Singh <neerajsi@microsoft.com>
>> >
>> > This commit introduces the `core.fsync` configuration
>> > knob which can be used to control how components of the
>> > repository are made durable on disk.
>> >
>> > This setting allows future extensibility of the list of
>> > syncable components:
>> > * We issue a warning rather than an error for unrecognized
>> >   components, so new configs can be used with old Git versions.
>>
>> Looks good!
>>
>> > * We support negation, so users can choose one of the default
>> >   aggregate options and then remove components that they don't
>> >   want. The user would then harden any new components added in
>> >   a Git version update.
>>
>> I think this config schema makes sense, but just a (I think important)
>> comment on the "how" not "what" of it. It's really much better to define
>> config as:
>>
>>     [some]
>>     key = value
>>     key = value2
>>
>> Than:
>>
>>     [some]
>>     key = value,value2
>>
>> The reason is that "git config" has good support for working with
>> multi-valued stuff, so you can do e.g.:
>>
>>     git config --get-all -z some.key
>>
>> And you can easily (re)set such config e.g. with --replace-all etc., but
>> for comma-delimited you (and users) need to do all that work themselves.
>>
>> Similarly instead of:
>>
>>     some.key = want-this
>>     some.key = -not-this
>>     some.key = but-want-this
>>
>> I think it's better to just have two lists, one inclusive another
>> exclusive. E.g. see "log.decorate" and "log.excludeDecoration",
>> "transfer.hideRefs"
>>
>> Which would mean:
>>
>>     core.fsync = want-this
>>     core.fsyncExcludes = -not-this
>>
>> For some value of "fsyncExcludes", maybe "noFsync"? Anyway, just a
>> suggestion on making this easier for users & the implementation.
>>
>
> Maybe there's some way to handle this I'm unaware of, but a
> disadvantage of your multi-valued config proposal is that it's harder,
> for example, for a per-repo config store to reasonably override a
> per-user config store.  With the configuration scheme as-is, I can
> have a per-user setting like `core.fsync=all` which covers my typical
> repos, but then have a maintainer repo with a private setting of
> `core.fsync=none` to speed up cases where I'm mostly working with
> other people's changes that are backed up in email or server-side
> repos.  The latter setting conveniently overrides the former setting
> in all aspects.

Even if you turn just your comma-delimited proposal into a list proposal
can't we just say that the last one wins? Then it won't matter what cmae
before, you'd specify "core.fsync=none" in your local .git/config.

But this is also a general issue with a bunch of things in git's config
space. I'd rather see us use the list-based values and just come up with
some general way to reset them that works for all keys, rather than
regretting having comma-delimited values that'll be harder to work with
& parse, which will be a legacy wart if/when we come up with a way to
say "reset all previous settings".

> Also, with the core.fsync and core.fsyncExcludes, how would you spell
> "don't sync anything"? Would you still have the aggregate options.?

    core.fsyncExcludes = *

I.e. the same as the core.fsync=none above, anyway I still like the
wildcard thing below a bit more...

>> > This also supports the common request of doing absolutely no
>> > fysncing with the `core.fsync=none` value, which is expected
>> > to make the test suite faster.
>>
>> Let's just use the git_parse_maybe_bool() or git_parse_maybe_bool_text()
>> so we'll accept "false", "off", "no" like most other such config?
>
> Junio's previous feedback when discussing batch mode [1] was to offer
> less flexibility when parsing new values of these configuration
> options. I agree with his statement that "making machine-readable
> tokens be spelled in different ways is a 'disease'."  I'd like to
> leave this as-is so that the documentation can clearly state the exact
> set of allowable values.
>
> [1] https://lore.kernel.org/git/xmqqr1dqzyl7.fsf@gitster.g/

I think he's talking about batch, Batch, BATCH, bAtCh etc. there. But
the "maybe bool" is a stanard pattern we use.

I don't think we'd call one of these 0, off, no or false etc. to avoid
confusion, so then you can use git_parse_maybe_...()

>>
>> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
>> > ---
>> >  Documentation/config/core.txt | 27 +++++++++----
>> >  builtin/fast-import.c         |  2 +-
>> >  builtin/index-pack.c          |  4 +-
>> >  builtin/pack-objects.c        |  8 ++--
>> >  bulk-checkin.c                |  5 ++-
>> >  cache.h                       | 39 +++++++++++++++++-
>> >  commit-graph.c                |  3 +-
>> >  config.c                      | 76 ++++++++++++++++++++++++++++++++++-
>> >  csum-file.c                   |  5 ++-
>> >  csum-file.h                   |  3 +-
>> >  environment.c                 |  1 +
>> >  midx.c                        |  3 +-
>> >  object-file.c                 |  3 +-
>> >  pack-bitmap-write.c           |  3 +-
>> >  pack-write.c                  | 13 +++---
>> >  read-cache.c                  |  2 +-
>> >  16 files changed, 164 insertions(+), 33 deletions(-)
>> >
>> > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
>> > index dbb134f7136..4f1747ec871 100644
>> > --- a/Documentation/config/core.txt
>> > +++ b/Documentation/config/core.txt
>> > @@ -547,6 +547,25 @@ core.whitespace::
>> >    is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
>> >    errors. The default tab width is 8. Allowed values are 1 to 63.
>> >
>> > +core.fsync::
>> > +     A comma-separated list of parts of the repository which should be
>> > +     hardened via the core.fsyncMethod when created or modified. You can
>> > +     disable hardening of any component by prefixing it with a '-'. Later
>> > +     items take precedence over earlier ones in the list. For example,
>> > +     `core.fsync=all,-pack-metadata` means "harden everything except pack
>> > +     metadata." Items that are not hardened may be lost in the event of an
>> > +     unclean system shutdown.
>> > ++
>> > +* `none` disables fsync completely. This must be specified alone.
>> > +* `loose-object` hardens objects added to the repo in loose-object form.
>> > +* `pack` hardens objects added to the repo in packfile form.
>> > +* `pack-metadata` hardens packfile bitmaps and indexes.
>> > +* `commit-graph` hardens the commit graph file.
>> > +* `objects` is an aggregate option that includes `loose-objects`, `pack`,
>> > +  `pack-metadata`, and `commit-graph`.
>> > +* `default` is an aggregate option that is equivalent to `objects,-loose-object`
>> > +* `all` is an aggregate option that syncs all individual components above.
>> > +
>>
>> It's probably a *bit* more work to set up, but I wonder if this wouldn't
>> be simpler if we just said (and this is partially going against what I
>> noted above):
>>
>> == BEGIN DOC
>>
>> core.fsync is a multi-value config variable where each item is a
>> pathspec that'll get matched the same way as 'git-ls-files' et al.
>>
>> When we sync pretend that a path like .git/objects/de/adbeef... is
>> relative to the top-level of the git
>> directory. E.g. "objects/de/adbeaf.." or "objects/pack/...".
>>
>> You can then supply a list of wildcards and exclusions to configure
>> syncing.  or "false", "off" etc. to turn it off. These are synonymous
>> with:
>>
>>     ; same as "false"
>>     core.fsync = ":!*"
>>
>> Or:
>>
>>     ; same as "true"
>>     core.fsync = "*"
>>
>> Or, to selectively sync some things and not others:
>>
>>     ;; Sync objects, but not "info"
>>     core.fsync = ":!objects/info/**"
>>     core.fsync = "objects/**"
>>
>> See gitrepository-layout(5) for details about what sort of paths you
>> might be expected to match. Not all paths listed there will go through
>> this mechanism (e.g. currently objects do, but nothing to do with config
>> does).
>>
>> We can and will match this against "fake paths", e.g. when writing out
>> packs we may match against just the string "objects/pack", we're not
>> going to re-check if every packfile we're writing matches your globs,
>> ditto for loose objects. Be reasonable!
>>
>> This metharism is intended as a shorthand that provides some flexibility
>> when fsyncing, while not forcing git to come up with labels for all
>> paths the git dir, or to support crazyness like "objects/de/adbeef*"
>>
>> More paths may be added or removed in the future, and we make no
>> promises that we won't move things around, so if in doubt use
>> e.g. "true" or a wide pattern match like "objects/**". When in doubt
>> stick to the golden path of examples provided in this documentation.
>>
>> == END DOC
>>
>>
>> It's a tad more complex to set up, but I wonder if that isn't worth
>> it. It nicely gets around any current and future issues of deciding what
>> labels such as "loose-object" etc. to pick, as well as slotting into an
>> existing method of doing exclude/include lists.
>>
>
> I think this proposal is a lot of complexity to avoid coming up with a
> new name for syncable things as they are added to Git.  A path based
> mechanism makes it hard to document for the (advanced) user what the
> full set of things is and how it might change from release to release.
> I think the current core.fsync scheme is a bit easier to understand,
> query, and extend.

We document it in gitrepository-layout(5). Yeah it has some
disadvantages, but one advantage is that you could make the
composability easy. I.e. if last exclude wins then a setting of:

    core.fsync = ":!*"
    core.fsync = "objects/**"

Would reset all previous matches & only match objects/**.

>> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>> > index 857be7826f3..916c55d6ce9 100644
>> > --- a/builtin/pack-objects.c
>> > +++ b/builtin/pack-objects.c
>> > @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
>> >                * If so, rewrite it like in fast-import
>> >                */
>> >               if (pack_to_stdout) {
>> > -                     finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>> > +                     finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
>> > +                                       CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>>
>> Not really related to this per-se, but since you're touching the API
>> everything goes through I wonder if callers should just always try to
>> fsync, and we can just catch EROFS and EINVAL in the wrapper if someone
>> tries to flush stdout, or catch the fd at that lower level.
>>
>> Or maybe there's a good reason for this...
>
> It's platform dependent, but I'd expect fsync would do something for
> pipes or stdout redirected to a file.  In these cases we really don't
> want to fsync since we have no idea what we're talking to and we're
> potentially worsening performance for probably no benefit.

Yeah maybe we should just leave it be.

I'd think the C library returning EINVAL would be a trivial performance
cost though.

It just seemed odd to hardcode assumptions about what can and can't be
synced when the POSIX defined function will also tell us that.

Anyway...

>> > [...]
>> > +/*
>> > + * These values are used to help identify parts of a repository to fsync.
>> > + * FSYNC_COMPONENT_NONE identifies data that will not be a persistent part of the
>> > + * repository and so shouldn't be fsynced.
>> > + */
>> > +enum fsync_component {
>> > +     FSYNC_COMPONENT_NONE                    = 0,
>>
>> I haven't read ahead much but in most other such cases we don't define
>> the "= 0", just start at 1<<0, then check the flags elsewhere...
>>
>> > +static const struct fsync_component_entry {
>> > +     const char *name;
>> > +     enum fsync_component component_bits;
>> > +} fsync_component_table[] = {
>> > +     { "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
>> > +     { "pack", FSYNC_COMPONENT_PACK },
>> > +     { "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
>> > +     { "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
>> > +     { "objects", FSYNC_COMPONENTS_OBJECTS },
>> > +     { "default", FSYNC_COMPONENTS_DEFAULT },
>> > +     { "all", FSYNC_COMPONENTS_ALL },
>> > +};
>> > +
>> > +static enum fsync_component parse_fsync_components(const char *var, const char *string)
>> > +{
>> > +     enum fsync_component output = 0;
>> > +
>> > +     if (!strcmp(string, "none"))
>> > +             return output;
>> > +
>> > +     while (string) {
>> > +             int i;
>> > +             size_t len;
>> > +             const char *ep;
>> > +             int negated = 0;
>> > +             int found = 0;
>> > +
>> > +             string = string + strspn(string, ", \t\n\r");
>>
>> Aside from the "use a list" isn't this hardcoding some windows-specific
>> assumptions with \n\r? Maybe not...
>
> I shamelessly stole this code from parse_whitespace_rule. I thought
> about making a helper to be called by both functions, but the amount
> of state going into and out of the wrapper via arguments was
> substantial and seemed to negate the benefit of deduplication.

FWIW string_list_split() is easier to work with in those cases, or at
least I think so...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Dec 8, 2021 at 2:17 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Tue, Dec 07 2021, Neeraj Singh wrote:
>
> > On Tue, Dec 7, 2021 at 5:01 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >>
> >>
> >> On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:
> >>
> >> > From: Neeraj Singh <neerajsi@microsoft.com>
> >> >
> >> > This commit introduces the `core.fsync` configuration
> >> > knob which can be used to control how components of the
> >> > repository are made durable on disk.
> >> >
> >> > This setting allows future extensibility of the list of
> >> > syncable components:
> >> > * We issue a warning rather than an error for unrecognized
> >> >   components, so new configs can be used with old Git versions.
> >>
> >> Looks good!
> >>
> >> > * We support negation, so users can choose one of the default
> >> >   aggregate options and then remove components that they don't
> >> >   want. The user would then harden any new components added in
> >> >   a Git version update.
> >>
> >> I think this config schema makes sense, but just a (I think important)
> >> comment on the "how" not "what" of it. It's really much better to define
> >> config as:
> >>
> >>     [some]
> >>     key = value
> >>     key = value2
> >>
> >> Than:
> >>
> >>     [some]
> >>     key = value,value2
> >>
> >> The reason is that "git config" has good support for working with
> >> multi-valued stuff, so you can do e.g.:
> >>
> >>     git config --get-all -z some.key
> >>
> >> And you can easily (re)set such config e.g. with --replace-all etc., but
> >> for comma-delimited you (and users) need to do all that work themselves.
> >>
> >> Similarly instead of:
> >>
> >>     some.key = want-this
> >>     some.key = -not-this
> >>     some.key = but-want-this
> >>
> >> I think it's better to just have two lists, one inclusive another
> >> exclusive. E.g. see "log.decorate" and "log.excludeDecoration",
> >> "transfer.hideRefs"
> >>
> >> Which would mean:
> >>
> >>     core.fsync = want-this
> >>     core.fsyncExcludes = -not-this
> >>
> >> For some value of "fsyncExcludes", maybe "noFsync"? Anyway, just a
> >> suggestion on making this easier for users & the implementation.
> >>
> >
> > Maybe there's some way to handle this I'm unaware of, but a
> > disadvantage of your multi-valued config proposal is that it's harder,
> > for example, for a per-repo config store to reasonably override a
> > per-user config store.  With the configuration scheme as-is, I can
> > have a per-user setting like `core.fsync=all` which covers my typical
> > repos, but then have a maintainer repo with a private setting of
> > `core.fsync=none` to speed up cases where I'm mostly working with
> > other people's changes that are backed up in email or server-side
> > repos.  The latter setting conveniently overrides the former setting
> > in all aspects.
>
> Even if you turn just your comma-delimited proposal into a list proposal
> can't we just say that the last one wins? Then it won't matter what cmae
> before, you'd specify "core.fsync=none" in your local .git/config.
>
> But this is also a general issue with a bunch of things in git's config
> space. I'd rather see us use the list-based values and just come up with
> some general way to reset them that works for all keys, rather than
> regretting having comma-delimited values that'll be harder to work with
> & parse, which will be a legacy wart if/when we come up with a way to
> say "reset all previous settings".
>
> > Also, with the core.fsync and core.fsyncExcludes, how would you spell
> > "don't sync anything"? Would you still have the aggregate options.?
>
>     core.fsyncExcludes = *
>
> I.e. the same as the core.fsync=none above, anyway I still like the
> wildcard thing below a bit more...

I'm not going to take this feedback unless there are additional votes
from the Git community in this direction.  I make the claim that
single-valued comma-separated config lists are easier to work with in
the existing Git infrastructure.  We already use essentially the same
parsing code for the core.whitespace variable and users are used to
this syntax there. There are several other comma-separated lists in
the config space, so this construct has precedence and will be with
Git for some time.  Also, fsync configurations aren't composable like
some other configurations may be. It makes sense to have a holistic
singular fsync configuration, which is best represented by a single
variable.

> >> > This also supports the common request of doing absolutely no
> >> > fysncing with the `core.fsync=none` value, which is expected
> >> > to make the test suite faster.
> >>
> >> Let's just use the git_parse_maybe_bool() or git_parse_maybe_bool_text()
> >> so we'll accept "false", "off", "no" like most other such config?
> >
> > Junio's previous feedback when discussing batch mode [1] was to offer
> > less flexibility when parsing new values of these configuration
> > options. I agree with his statement that "making machine-readable
> > tokens be spelled in different ways is a 'disease'."  I'd like to
> > leave this as-is so that the documentation can clearly state the exact
> > set of allowable values.
> >
> > [1] https://lore.kernel.org/git/xmqqr1dqzyl7.fsf@gitster.g/
>
> I think he's talking about batch, Batch, BATCH, bAtCh etc. there. But
> the "maybe bool" is a stanard pattern we use.
>
> I don't think we'd call one of these 0, off, no or false etc. to avoid
> confusion, so then you can use git_parse_maybe_...()

I don't see the advantage of having multiple ways of specifying
"none".  The user can read the doc and know exactly what to write.  If
they write something unallowable, they get a clear warning and they
can read the doc again to figure out what to write.  This isn't a
boolean options at all, so why should we entertain bool-like ways of
spelling it?

> >>
> >> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> >> > ---
> >> >  Documentation/config/core.txt | 27 +++++++++----
> >> >  builtin/fast-import.c         |  2 +-
> >> >  builtin/index-pack.c          |  4 +-
> >> >  builtin/pack-objects.c        |  8 ++--
> >> >  bulk-checkin.c                |  5 ++-
> >> >  cache.h                       | 39 +++++++++++++++++-
> >> >  commit-graph.c                |  3 +-
> >> >  config.c                      | 76 ++++++++++++++++++++++++++++++++++-
> >> >  csum-file.c                   |  5 ++-
> >> >  csum-file.h                   |  3 +-
> >> >  environment.c                 |  1 +
> >> >  midx.c                        |  3 +-
> >> >  object-file.c                 |  3 +-
> >> >  pack-bitmap-write.c           |  3 +-
> >> >  pack-write.c                  | 13 +++---
> >> >  read-cache.c                  |  2 +-
> >> >  16 files changed, 164 insertions(+), 33 deletions(-)
> >> >
> >> > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> >> > index dbb134f7136..4f1747ec871 100644
> >> > --- a/Documentation/config/core.txt
> >> > +++ b/Documentation/config/core.txt
> >> > @@ -547,6 +547,25 @@ core.whitespace::
> >> >    is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
> >> >    errors. The default tab width is 8. Allowed values are 1 to 63.
> >> >
> >> > +core.fsync::
> >> > +     A comma-separated list of parts of the repository which should be
> >> > +     hardened via the core.fsyncMethod when created or modified. You can
> >> > +     disable hardening of any component by prefixing it with a '-'. Later
> >> > +     items take precedence over earlier ones in the list. For example,
> >> > +     `core.fsync=all,-pack-metadata` means "harden everything except pack
> >> > +     metadata." Items that are not hardened may be lost in the event of an
> >> > +     unclean system shutdown.
> >> > ++
> >> > +* `none` disables fsync completely. This must be specified alone.
> >> > +* `loose-object` hardens objects added to the repo in loose-object form.
> >> > +* `pack` hardens objects added to the repo in packfile form.
> >> > +* `pack-metadata` hardens packfile bitmaps and indexes.
> >> > +* `commit-graph` hardens the commit graph file.
> >> > +* `objects` is an aggregate option that includes `loose-objects`, `pack`,
> >> > +  `pack-metadata`, and `commit-graph`.
> >> > +* `default` is an aggregate option that is equivalent to `objects,-loose-object`
> >> > +* `all` is an aggregate option that syncs all individual components above.
> >> > +
> >>
> >> It's probably a *bit* more work to set up, but I wonder if this wouldn't
> >> be simpler if we just said (and this is partially going against what I
> >> noted above):
> >>
> >> == BEGIN DOC
> >>
> >> core.fsync is a multi-value config variable where each item is a
> >> pathspec that'll get matched the same way as 'git-ls-files' et al.
> >>
> >> When we sync pretend that a path like .git/objects/de/adbeef... is
> >> relative to the top-level of the git
> >> directory. E.g. "objects/de/adbeaf.." or "objects/pack/...".
> >>
> >> You can then supply a list of wildcards and exclusions to configure
> >> syncing.  or "false", "off" etc. to turn it off. These are synonymous
> >> with:
> >>
> >>     ; same as "false"
> >>     core.fsync = ":!*"
> >>
> >> Or:
> >>
> >>     ; same as "true"
> >>     core.fsync = "*"
> >>
> >> Or, to selectively sync some things and not others:
> >>
> >>     ;; Sync objects, but not "info"
> >>     core.fsync = ":!objects/info/**"
> >>     core.fsync = "objects/**"
> >>
> >> See gitrepository-layout(5) for details about what sort of paths you
> >> might be expected to match. Not all paths listed there will go through
> >> this mechanism (e.g. currently objects do, but nothing to do with config
> >> does).
> >>
> >> We can and will match this against "fake paths", e.g. when writing out
> >> packs we may match against just the string "objects/pack", we're not
> >> going to re-check if every packfile we're writing matches your globs,
> >> ditto for loose objects. Be reasonable!
> >>
> >> This metharism is intended as a shorthand that provides some flexibility
> >> when fsyncing, while not forcing git to come up with labels for all
> >> paths the git dir, or to support crazyness like "objects/de/adbeef*"
> >>
> >> More paths may be added or removed in the future, and we make no
> >> promises that we won't move things around, so if in doubt use
> >> e.g. "true" or a wide pattern match like "objects/**". When in doubt
> >> stick to the golden path of examples provided in this documentation.
> >>
> >> == END DOC
> >>
> >>
> >> It's a tad more complex to set up, but I wonder if that isn't worth
> >> it. It nicely gets around any current and future issues of deciding what
> >> labels such as "loose-object" etc. to pick, as well as slotting into an
> >> existing method of doing exclude/include lists.
> >>
> >
> > I think this proposal is a lot of complexity to avoid coming up with a
> > new name for syncable things as they are added to Git.  A path based
> > mechanism makes it hard to document for the (advanced) user what the
> > full set of things is and how it might change from release to release.
> > I think the current core.fsync scheme is a bit easier to understand,
> > query, and extend.
>
> We document it in gitrepository-layout(5). Yeah it has some
> disadvantages, but one advantage is that you could make the
> composability easy. I.e. if last exclude wins then a setting of:
>
>     core.fsync = ":!*"
>     core.fsync = "objects/**"
>
> Would reset all previous matches & only match objects/**.
>

The value of changing this is predicated on taking your previous
multi-valued config proposal, which I'm still not at all convinced
about.  The schema in the current (v1-v2) version of the patch already
includes an example of extending the list of syncable things, and
Patrick Steinhardt made it clear that he feels comfortable adding
'refs' to the same schema in a future change.

I'll also emphasize that we're talking about a non-functional,
relatively corner-case behavioral configuration.  These values don't
change how git's interface behaves except when the system crashes
during a git command or shortly after one completes.

While you may not personally love the proposed configuration
interface, I'd want your view on some questions:
1. Is it easy for the (advanced) user to set a configuration?
2. Is it easy for the (advanced) user to see what was configured?
3. Is it easy for the Git community to build on this as we want to add
things to the list of things to sync?
    a) Is there a good best practice configuration so that people can
avoid losing integrity for new stuff that they are intending to sync.
    b) If someone has a custom configuration, can that custom
configuration do something reasonable as they upgrade versions of Git?
             ** In response to this question, I might see some value
in adding a 'derived-metadata' aggregate that can be disabled so that
a custom configuration can exclude those as they change version to
version.
    c) Is it too much maintenance overhead to consider how to present
this configuration knob for any new hashfile or other datafile in the
git repo?
4. Is there a good path forward to change the default syncable set,
both in git-for-windows and in Git for other platforms?

> >> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> >> > index 857be7826f3..916c55d6ce9 100644
> >> > --- a/builtin/pack-objects.c
> >> > +++ b/builtin/pack-objects.c
> >> > @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
> >> >                * If so, rewrite it like in fast-import
> >> >                */
> >> >               if (pack_to_stdout) {
> >> > -                     finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> >> > +                     finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
> >> > +                                       CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> >>
> >> Not really related to this per-se, but since you're touching the API
> >> everything goes through I wonder if callers should just always try to
> >> fsync, and we can just catch EROFS and EINVAL in the wrapper if someone
> >> tries to flush stdout, or catch the fd at that lower level.
> >>
> >> Or maybe there's a good reason for this...
> >
> > It's platform dependent, but I'd expect fsync would do something for
> > pipes or stdout redirected to a file.  In these cases we really don't
> > want to fsync since we have no idea what we're talking to and we're
> > potentially worsening performance for probably no benefit.
>
> Yeah maybe we should just leave it be.
>
> I'd think the C library returning EINVAL would be a trivial performance
> cost though.
>
> It just seemed odd to hardcode assumptions about what can and can't be
> synced when the POSIX defined function will also tell us that.
>

Redirecting stdout to a file seems like a common usage for this
command. That would definitely be fsyncable, but Git has no idea what
its proper category is since there's no way to know the purpose or
lifetime of the packfile.  I'm going to leave this be, because I'd
posit that "can it be fsynced?" is not the same as "should it be
fsynced?".  The latter question can't be answered for stdout.

>
> >> > [...]
> >> > +/*
> >> > + * These values are used to help identify parts of a repository to fsync.
> >> > + * FSYNC_COMPONENT_NONE identifies data that will not be a persistent part of the
> >> > + * repository and so shouldn't be fsynced.
> >> > + */
> >> > +enum fsync_component {
> >> > +     FSYNC_COMPONENT_NONE                    = 0,
> >>
> >> I haven't read ahead much but in most other such cases we don't define
> >> the "= 0", just start at 1<<0, then check the flags elsewhere...
> >>
> >> > +static const struct fsync_component_entry {
> >> > +     const char *name;
> >> > +     enum fsync_component component_bits;
> >> > +} fsync_component_table[] = {
> >> > +     { "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
> >> > +     { "pack", FSYNC_COMPONENT_PACK },
> >> > +     { "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
> >> > +     { "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
> >> > +     { "objects", FSYNC_COMPONENTS_OBJECTS },
> >> > +     { "default", FSYNC_COMPONENTS_DEFAULT },
> >> > +     { "all", FSYNC_COMPONENTS_ALL },
> >> > +};
> >> > +
> >> > +static enum fsync_component parse_fsync_components(const char *var, const char *string)
> >> > +{
> >> > +     enum fsync_component output = 0;
> >> > +
> >> > +     if (!strcmp(string, "none"))
> >> > +             return output;
> >> > +
> >> > +     while (string) {
> >> > +             int i;
> >> > +             size_t len;
> >> > +             const char *ep;
> >> > +             int negated = 0;
> >> > +             int found = 0;
> >> > +
> >> > +             string = string + strspn(string, ", \t\n\r");
> >>
> >> Aside from the "use a list" isn't this hardcoding some windows-specific
> >> assumptions with \n\r? Maybe not...
> >
> > I shamelessly stole this code from parse_whitespace_rule. I thought
> > about making a helper to be called by both functions, but the amount
> > of state going into and out of the wrapper via arguments was
> > substantial and seemed to negate the benefit of deduplication.
>
> FWIW string_list_split() is easier to work with in those cases, or at
> least I think so...

This code runs at startup for a variable that may be present on some
installations.  The nice property of the current patch's code is that
it's already a well-tested pattern that doesn't do any allocations as
it's working, unlike string_list_split().

I hope you know that I appreciate your review feedback, even though
I'm pushing back on most of it so far this round. I'll be sending v3
to the list soon after giving it another look over.

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Neeraj Singh <nksingh85@gmail.com> writes:

> I'm not going to take this feedback unless there are additional votes
> from the Git community in this direction.  I make the claim that
> single-valued comma-separated config lists are easier to work with in
> the existing Git infrastructure.  We already use essentially the same
> parsing code for the core.whitespace variable and users are used to
> this syntax there. There are several other comma-separated lists in
> the config space, so this construct has precedence and will be with
> Git for some time.  Also, fsync configurations aren't composable like
> some other configurations may be. It makes sense to have a holistic
> singular fsync configuration, which is best represented by a single
> variable.

I haven't caught up with the discussion in this thread, even though
I have been meaning to think about it for some time---I just haven't
got around to it (sorry).  So I'll stop at giving a general guidance
and leave the decision if it applies to this particular discussion
to readers.

As the inventor of core.whitespace "list of values and its parser, I
am biased, but I would say that it works well for simple things that
do not need too much overriding.  The other side of the coin is that
it can become very awkward going forward if we use it to things that
have more complex needs than answering a simple question like "what
whitespace errors should be checked?".

More specifically, core.whitespace is pecuriar in a few ways.

 * It does follow the usual "the last one wins" rule, but in a
   strange way.  Notice the "unsigned rule = WS_DEFAULT_RULE"
   assignment at the beginning of ws.c::parse_whitespace_rule()?
   For each configuration "core.whitespace=<list>" we encounter,
   we start from the default, discarding everything we saw so far,
   and tweak that default value with tokens found on the list.

 * You cannot do "The system config gives one set of values, which
   you tweak with the personal config, which is further tweaked with
   the repository config" as the consequence.  This is blessing and
   is curse at the same time, as it makes inspection simpler when
   things go wrong (you only need to check whta the last one does),
   but it is harder to share the common things in more common file.

 * Its design relies on the choices being strings chosen from a
   fixed vocabulary to allow you to say "the value of the
   configuration variable is a list of tokens separated by a comma"
   and "the default has X bit set, but we disable that with -X".
   For a configuration variable whose value is an arbitrary string
   or a number, obviously that approach would not work.

If the need of the topic is simple enough that the above limitation
core.whitespace does not pose a problem going forward, it would be
fine, but we may regret the choice we make today if that is not the
case.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Dec 8, 2021 at 8:46 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> On Wed, Dec 08 2021, Neeraj Singh wrote:
>
> > On Wed, Dec 8, 2021 at 2:17 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >>
> >>
> >> On Tue, Dec 07 2021, Neeraj Singh wrote:
> >>
> >> > On Tue, Dec 7, 2021 at 5:01 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >> >>
> >> >>
> >> >> On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:
> >> >>
> >> >> > From: Neeraj Singh <neerajsi@microsoft.com>
> >> >> >
> >> >> > This commit introduces the `core.fsync` configuration
> >> >> > knob which can be used to control how components of the
> >> >> > repository are made durable on disk.
> >> >> >
> >> >> > This setting allows future extensibility of the list of
> >> >> > syncable components:
> >> >> > * We issue a warning rather than an error for unrecognized
> >> >> >   components, so new configs can be used with old Git versions.
> >> >>
> >> >> Looks good!
> >> >>
> >> >> > * We support negation, so users can choose one of the default
> >> >> >   aggregate options and then remove components that they don't
> >> >> >   want. The user would then harden any new components added in
> >> >> >   a Git version update.
> >> >>
> >> >> I think this config schema makes sense, but just a (I think important)
> >> >> comment on the "how" not "what" of it. It's really much better to define
> >> >> config as:
> >> >>
> >> >>     [some]
> >> >>     key = value
> >> >>     key = value2
> >> >>
> >> >> Than:
> >> >>
> >> >>     [some]
> >> >>     key = value,value2
> >> >>
> >> >> The reason is that "git config" has good support for working with
> >> >> multi-valued stuff, so you can do e.g.:
> >> >>
> >> >>     git config --get-all -z some.key
> >> >>
> >> >> And you can easily (re)set such config e.g. with --replace-all etc., but
> >> >> for comma-delimited you (and users) need to do all that work themselves.
> >> >>
> >> >> Similarly instead of:
> >> >>
> >> >>     some.key = want-this
> >> >>     some.key = -not-this
> >> >>     some.key = but-want-this
> >> >>
> >> >> I think it's better to just have two lists, one inclusive another
> >> >> exclusive. E.g. see "log.decorate" and "log.excludeDecoration",
> >> >> "transfer.hideRefs"
> >> >>
> >> >> Which would mean:
> >> >>
> >> >>     core.fsync = want-this
> >> >>     core.fsyncExcludes = -not-this
> >> >>
> >> >> For some value of "fsyncExcludes", maybe "noFsync"? Anyway, just a
> >> >> suggestion on making this easier for users & the implementation.
> >> >>
> >> >
> >> > Maybe there's some way to handle this I'm unaware of, but a
> >> > disadvantage of your multi-valued config proposal is that it's harder,
> >> > for example, for a per-repo config store to reasonably override a
> >> > per-user config store.  With the configuration scheme as-is, I can
> >> > have a per-user setting like `core.fsync=all` which covers my typical
> >> > repos, but then have a maintainer repo with a private setting of
> >> > `core.fsync=none` to speed up cases where I'm mostly working with
> >> > other people's changes that are backed up in email or server-side
> >> > repos.  The latter setting conveniently overrides the former setting
> >> > in all aspects.
> >>
> >> Even if you turn just your comma-delimited proposal into a list proposal
> >> can't we just say that the last one wins? Then it won't matter what cmae
> >> before, you'd specify "core.fsync=none" in your local .git/config.
> >>
> >> But this is also a general issue with a bunch of things in git's config
> >> space. I'd rather see us use the list-based values and just come up with
> >> some general way to reset them that works for all keys, rather than
> >> regretting having comma-delimited values that'll be harder to work with
> >> & parse, which will be a legacy wart if/when we come up with a way to
> >> say "reset all previous settings".
> >>
> >> > Also, with the core.fsync and core.fsyncExcludes, how would you spell
> >> > "don't sync anything"? Would you still have the aggregate options.?
> >>
> >>     core.fsyncExcludes = *
> >>
> >> I.e. the same as the core.fsync=none above, anyway I still like the
> >> wildcard thing below a bit more...
> >
> > I'm not going to take this feedback unless there are additional votes
> > from the Git community in this direction.  I make the claim that
> > single-valued comma-separated config lists are easier to work with in
> > the existing Git infrastructure.
>
> Easier in what sense? I showed examples of how "git-config" trivially
> works with multi-valued config, but for comma-delimited you'll need to
> write your own shellscript boilerplate around simple things like adding
> values, removing existing ones etc.
>
> I.e. you can use --add, --unset, --unset-all, --get, --get-all etc.
>

I see what you're saying for cases where someone would want to set a
core.fsync setting that's derived from the user's current config.  But
I'm guessing that the dominant use case is someone setting a new fsync
configuration that achieves some atomic goal with respect to a given
repository. Like "this is a throwaway, so sync nothing" or "this is
really important, so sync all objects and refs and the index".

> > parsing code for the core.whitespace variable and users are used to
> > this syntax there. There are several other comma-separated lists in
> > the config space, so this construct has precedence and will be with
> > Git for some time.
>
> That's not really an argument either way for why we'd pick X over Y for
> something new. We've got some comma-delimited, some multi-value (I'm
> fairly sure we have more multi-value).
>

My main point here is that there's precedent for patch's current exact
schema for a config in the same core config leaf of the Documentation.
It seems from your comments that we'd have to invent and document some
new convention for "reset" of a multi-valued config.  So you're
suggesting that I solve an extra set of problems to get this change
in.  Just want to remind you that my personal itch to scratch is to
get the underlying mechanism in so that git-for-windows can set its
default setting to batch mode. I'm not expecting many users to
actually configure this setting to any non-default value.

> > Also, fsync configurations aren't composable like
> > some other configurations may be. It makes sense to have a holistic
> > singular fsync configuration, which is best represented by a single
> > variable.
>
> What's a "variable" here? We call these "keys", you can have a
> single-value key like user.name that you get with --get, or a
> multi-value key like say branch.<name>.merge or push.pushOption that
> you'd get with --get-all.

Yeah, I meant "key".  I conflated the config key with the underlying
global variable in git.

> I think you may be referring to either not wanting these to be
> "inherited" (which is not really a think we do for anything else in
> config), or lacking the ability to "reset".
>
> For the latter if that's absolutely needed we could just use the same
> trick as "diff.wsErrorHighlight" uses of making an empty value "reset"
> the list, and you'd get better "git config" support for editing it.
>

My reading of the code is that diff.wsErrorHighlight is a comma
separated list and not a multi-valued config.  Actually I haven't yet
found an existing multi-valued config (not sure how to grep for it).

> >> >> > This also supports the common request of doing absolutely no
> >> >> > fysncing with the `core.fsync=none` value, which is expected
> >> >> > to make the test suite faster.
> >> >>
> >> >> Let's just use the git_parse_maybe_bool() or git_parse_maybe_bool_text()
> >> >> so we'll accept "false", "off", "no" like most other such config?
> >> >
> >> > Junio's previous feedback when discussing batch mode [1] was to offer
> >> > less flexibility when parsing new values of these configuration
> >> > options. I agree with his statement that "making machine-readable
> >> > tokens be spelled in different ways is a 'disease'."  I'd like to
> >> > leave this as-is so that the documentation can clearly state the exact
> >> > set of allowable values.
> >> >
> >> > [1] https://lore.kernel.org/git/xmqqr1dqzyl7.fsf@gitster.g/
> >>
> >> I think he's talking about batch, Batch, BATCH, bAtCh etc. there. But
> >> the "maybe bool" is a stanard pattern we use.
> >>
> >> I don't think we'd call one of these 0, off, no or false etc. to avoid
> >> confusion, so then you can use git_parse_maybe_...()
> >
> > I don't see the advantage of having multiple ways of specifying
> > "none".  The user can read the doc and know exactly what to write.  If
> > they write something unallowable, they get a clear warning and they
> > can read the doc again to figure out what to write.  This isn't a
> > boolean options at all, so why should we entertain bool-like ways of
> > spelling it?
>
> It's not boolean, it's multi-value and one of the values includes a true
> or false boolean value. You just spell it "none".
>
> I think both this and your comment above suggest that you think there's
> no point in this because you haven't interacted with/used "git config"
> as a command line or API mechanism, but have just hand-crafted config
> files.
>
> That's fair enough, but there's a lot of tooling that benefits from the
> latter.

My batch mode perf tests (on github, not yet submitted to the list)
use `git -c core.fsync=<foo>` to set up a per-process config.  I
haven't used the `git config` writing support in a while, so I haven't
deeply thought about that.  However, I favor simplifying the use case
of "atomically" setting a new holistic core.fsync value versus the use
case of deriving a new core.fsync value from the preexisting value.

> E.g.:
>
>     $ git -c core.fsync=off config --type=bool core.fsync
>     false
>     $ git -c core.fsync=blah config --type=bool core.fsync
>     fatal: bad boolean config value 'blah' for 'core.fsync'
>
> Here we can get 'git config' to normalize what you call 'none', and you
> can tell via exit codes/normalization if it's "false". But if you invent
> a new term for "false" you can't do that as easily.
>
> We have various historical keys that take odd exceptions to that,
> e.g. "never", but unless we have a good reason to let's not invent more
> exceptions.
>
> >> >> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> >> >> > ---
> >> >> >  Documentation/config/core.txt | 27 +++++++++----
> >> >> >  builtin/fast-import.c         |  2 +-
> >> >> >  builtin/index-pack.c          |  4 +-
> >> >> >  builtin/pack-objects.c        |  8 ++--
> >> >> >  bulk-checkin.c                |  5 ++-
> >> >> >  cache.h                       | 39 +++++++++++++++++-
> >> >> >  commit-graph.c                |  3 +-
> >> >> >  config.c                      | 76 ++++++++++++++++++++++++++++++++++-
> >> >> >  csum-file.c                   |  5 ++-
> >> >> >  csum-file.h                   |  3 +-
> >> >> >  environment.c                 |  1 +
> >> >> >  midx.c                        |  3 +-
> >> >> >  object-file.c                 |  3 +-
> >> >> >  pack-bitmap-write.c           |  3 +-
> >> >> >  pack-write.c                  | 13 +++---
> >> >> >  read-cache.c                  |  2 +-
> >> >> >  16 files changed, 164 insertions(+), 33 deletions(-)
> >> >> >
> >> >> > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> >> >> > index dbb134f7136..4f1747ec871 100644
> >> >> > --- a/Documentation/config/core.txt
> >> >> > +++ b/Documentation/config/core.txt
> >> >> > @@ -547,6 +547,25 @@ core.whitespace::
> >> >> >    is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
> >> >> >    errors. The default tab width is 8. Allowed values are 1 to 63.
> >> >> >
> >> >> > +core.fsync::
> >> >> > +     A comma-separated list of parts of the repository which should be
> >> >> > +     hardened via the core.fsyncMethod when created or modified. You can
> >> >> > +     disable hardening of any component by prefixing it with a '-'. Later
> >> >> > +     items take precedence over earlier ones in the list. For example,
> >> >> > +     `core.fsync=all,-pack-metadata` means "harden everything except pack
> >> >> > +     metadata." Items that are not hardened may be lost in the event of an
> >> >> > +     unclean system shutdown.
> >> >> > ++
> >> >> > +* `none` disables fsync completely. This must be specified alone.
> >> >> > +* `loose-object` hardens objects added to the repo in loose-object form.
> >> >> > +* `pack` hardens objects added to the repo in packfile form.
> >> >> > +* `pack-metadata` hardens packfile bitmaps and indexes.
> >> >> > +* `commit-graph` hardens the commit graph file.
> >> >> > +* `objects` is an aggregate option that includes `loose-objects`, `pack`,
> >> >> > +  `pack-metadata`, and `commit-graph`.
> >> >> > +* `default` is an aggregate option that is equivalent to `objects,-loose-object`
> >> >> > +* `all` is an aggregate option that syncs all individual components above.
> >> >> > +
> >> >>
> >> >> It's probably a *bit* more work to set up, but I wonder if this wouldn't
> >> >> be simpler if we just said (and this is partially going against what I
> >> >> noted above):
> >> >>
> >> >> == BEGIN DOC
> >> >>
> >> >> core.fsync is a multi-value config variable where each item is a
> >> >> pathspec that'll get matched the same way as 'git-ls-files' et al.
> >> >>
> >> >> When we sync pretend that a path like .git/objects/de/adbeef... is
> >> >> relative to the top-level of the git
> >> >> directory. E.g. "objects/de/adbeaf.." or "objects/pack/...".
> >> >>
> >> >> You can then supply a list of wildcards and exclusions to configure
> >> >> syncing.  or "false", "off" etc. to turn it off. These are synonymous
> >> >> with:
> >> >>
> >> >>     ; same as "false"
> >> >>     core.fsync = ":!*"
> >> >>
> >> >> Or:
> >> >>
> >> >>     ; same as "true"
> >> >>     core.fsync = "*"
> >> >>
> >> >> Or, to selectively sync some things and not others:
> >> >>
> >> >>     ;; Sync objects, but not "info"
> >> >>     core.fsync = ":!objects/info/**"
> >> >>     core.fsync = "objects/**"
> >> >>
> >> >> See gitrepository-layout(5) for details about what sort of paths you
> >> >> might be expected to match. Not all paths listed there will go through
> >> >> this mechanism (e.g. currently objects do, but nothing to do with config
> >> >> does).
> >> >>
> >> >> We can and will match this against "fake paths", e.g. when writing out
> >> >> packs we may match against just the string "objects/pack", we're not
> >> >> going to re-check if every packfile we're writing matches your globs,
> >> >> ditto for loose objects. Be reasonable!
> >> >>
> >> >> This metharism is intended as a shorthand that provides some flexibility
> >> >> when fsyncing, while not forcing git to come up with labels for all
> >> >> paths the git dir, or to support crazyness like "objects/de/adbeef*"
> >> >>
> >> >> More paths may be added or removed in the future, and we make no
> >> >> promises that we won't move things around, so if in doubt use
> >> >> e.g. "true" or a wide pattern match like "objects/**". When in doubt
> >> >> stick to the golden path of examples provided in this documentation.
> >> >>
> >> >> == END DOC
> >> >>
> >> >>
> >> >> It's a tad more complex to set up, but I wonder if that isn't worth
> >> >> it. It nicely gets around any current and future issues of deciding what
> >> >> labels such as "loose-object" etc. to pick, as well as slotting into an
> >> >> existing method of doing exclude/include lists.
> >> >>
> >> >
> >> > I think this proposal is a lot of complexity to avoid coming up with a
> >> > new name for syncable things as they are added to Git.  A path based
> >> > mechanism makes it hard to document for the (advanced) user what the
> >> > full set of things is and how it might change from release to release.
> >> > I think the current core.fsync scheme is a bit easier to understand,
> >> > query, and extend.
> >>
> >> We document it in gitrepository-layout(5). Yeah it has some
> >> disadvantages, but one advantage is that you could make the
> >> composability easy. I.e. if last exclude wins then a setting of:
> >>
> >>     core.fsync = ":!*"
> >>     core.fsync = "objects/**"
> >>
> >> Would reset all previous matches & only match objects/**.
> >>
> >
> > The value of changing this is predicated on taking your previous
> > multi-valued config proposal, which I'm still not at all convinced
> > about.
>
> They're orthagonal. I.e. you get benefits from multi-value with or
> without this globbing mechanism.
>
> In any case, I don't feel strongly about/am really advocating this
> globbing mechanism. I just wondered if it wouldn't make things simpler
> since it would sidestep the need to create any sort of categories for
> subsets of gitrepository-layout(5), but maybe not...
>
> > The schema in the current (v1-v2) version of the patch already
> > includes an example of extending the list of syncable things, and
> > Patrick Steinhardt made it clear that he feels comfortable adding
> > 'refs' to the same schema in a future change.
> >
> > I'll also emphasize that we're talking about a non-functional,
> > relatively corner-case behavioral configuration.  These values don't
> > change how git's interface behaves except when the system crashes
> > during a git command or shortly after one completes.
>
> That may be something some OS's promise, but it's not something fsync()
> or POSIX promises. I.e. you might write a ref, but unless you fsync and
> the relevant dir entries another process might not see the update, crash
> or not.
>

I haven't seen any indication that POSIX requires an fsync for
visiblity within a running system.  I looked at the doc for open,
write, and fsync, and saw no indication that it's posix compliant to
require an fsync for visibility.  I think an OS that required fsync
for cross-process visiblity would fail to run Git for a myriad of
other reasons and would likely lose all its users.  I'm curious where
you've seen documentation that allows such unhelpful behavior?

> That's an aside from these config design questions, and I think
> most/(all?) OS's anyone cares about these days tend to make that
> implicit promise as part of their VFS behavior, but we should probably
> design such an interface to fsync() with such pedantic portability in
> mind.

Why? To be rude to such a hypothetical system, if a system were so
insanely designed, it would be nuts to support it.

> > While you may not personally love the proposed configuration
> > interface, I'd want your view on some questions:
> > 1. Is it easy for the (advanced) user to set a configuration?
> > 2. Is it easy for the (advanced) user to see what was configured?
> > 3. Is it easy for the Git community to build on this as we want to add
> > things to the list of things to sync?
> >     a) Is there a good best practice configuration so that people can
> > avoid losing integrity for new stuff that they are intending to sync.
> >     b) If someone has a custom configuration, can that custom
> > configuration do something reasonable as they upgrade versions of Git?
> >              ** In response to this question, I might see some value
> > in adding a 'derived-metadata' aggregate that can be disabled so that
> > a custom configuration can exclude those as they change version to
> > version.
> >     c) Is it too much maintenance overhead to consider how to present
> > this configuration knob for any new hashfile or other datafile in the
> > git repo?
> > 4. Is there a good path forward to change the default syncable set,
> > both in git-for-windows and in Git for other platforms?
>
> I'm not really sure this globbing this is a good idea, as noted above
> just a suggestion etc.
>
> As noted there it just gets you out of the business of re-defining
> gitrepository-layout(5), and assuming too much in advance about certain
> use-cases.
>
> E.g. even "refs" might be too broad for some. I don't tend to be I/O
> limited, but I could see how someone who would be would care about
> refs/heads but not refs/remotes, or want to exclude logs/* but not the
> refs updates themselves etc.

This use-case is interesting (distinguishing remote refs from local
refs).  I think the difficulty of verifying (for even an advanced
user) that the right fsyncing is actually happening still puts me on
the side of having a carefully curated and documented set of syncable
things rather than a file-path-based mechanism.

Is this meaningful in the presumably nearby future world of the refsdb
backend?  Is that somehow split by remote versus local?

> >> >> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> >> >> > index 857be7826f3..916c55d6ce9 100644
> >> >> > --- a/builtin/pack-objects.c
> >> >> > +++ b/builtin/pack-objects.c
> >> >> > @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
> >> >> >                * If so, rewrite it like in fast-import
> >> >> >                */
> >> >> >               if (pack_to_stdout) {
> >> >> > -                     finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> >> >> > +                     finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
> >> >> > +                                       CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> >> >>
> >> >> Not really related to this per-se, but since you're touching the API
> >> >> everything goes through I wonder if callers should just always try to
> >> >> fsync, and we can just catch EROFS and EINVAL in the wrapper if someone
> >> >> tries to flush stdout, or catch the fd at that lower level.
> >> >>
> >> >> Or maybe there's a good reason for this...
> >> >
> >> > It's platform dependent, but I'd expect fsync would do something for
> >> > pipes or stdout redirected to a file.  In these cases we really don't
> >> > want to fsync since we have no idea what we're talking to and we're
> >> > potentially worsening performance for probably no benefit.
> >>
> >> Yeah maybe we should just leave it be.
> >>
> >> I'd think the C library returning EINVAL would be a trivial performance
> >> cost though.
> >>
> >> It just seemed odd to hardcode assumptions about what can and can't be
> >> synced when the POSIX defined function will also tell us that.
> >>
> >
> > Redirecting stdout to a file seems like a common usage for this
> > command. That would definitely be fsyncable, but Git has no idea what
> > its proper category is since there's no way to know the purpose or
> > lifetime of the packfile.  I'm going to leave this be, because I'd
> > posit that "can it be fsynced?" is not the same as "should it be
> > fsynced?".  The latter question can't be answered for stdout.
>
> As noted this was just an aside, and I don't even know if any OS would
> do anything meaningful with an fsync() of such a FD anyway.
>

The underlying fsync primitive does have a meaning on Windows for
pipes, but it's certainly not what Git would want to do. Also if
stdout is redirected to a file, I'm pretty sure that UNIX OSes would
respect the fsync call.  However it's not meaningful in the sense of
the git repository, since we don't know what the packfile is or why it
was created.

> I just don't see why we wouldn't say:
>
>  1. We're syncing this category of thing
>  2. Try it
>  3. If fsync returns "can't fsync that sort of thing" move on
>
> As opposed to trying to shortcut #3 by doing the detection ourselves.
>
> I.e. maybe there was a good reason, but it seemed to be some easy
> potential win for more simplification, since you were re-doing and
> simplifying some of the interface anyway...

We're trying to be deliberate about what we're fsyncing.  Fsyncing an
unknown file created by the packfile code doesn't move us in that
direction.  In your taxonomy we don't know (1), "what is this category
of thing?"  Sure it's got the packfile format, but is not known to be
an actual packfile that's part of the repository.

> >>
> >> >> > [...]
> >> >> > +/*
> >> >> > + * These values are used to help identify parts of a repository to fsync.
> >> >> > + * FSYNC_COMPONENT_NONE identifies data that will not be a persistent part of the
> >> >> > + * repository and so shouldn't be fsynced.
> >> >> > + */
> >> >> > +enum fsync_component {
> >> >> > +     FSYNC_COMPONENT_NONE                    = 0,
> >> >>
> >> >> I haven't read ahead much but in most other such cases we don't define
> >> >> the "= 0", just start at 1<<0, then check the flags elsewhere...
> >> >>
> >> >> > +static const struct fsync_component_entry {
> >> >> > +     const char *name;
> >> >> > +     enum fsync_component component_bits;
> >> >> > +} fsync_component_table[] = {
> >> >> > +     { "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
> >> >> > +     { "pack", FSYNC_COMPONENT_PACK },
> >> >> > +     { "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
> >> >> > +     { "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
> >> >> > +     { "objects", FSYNC_COMPONENTS_OBJECTS },
> >> >> > +     { "default", FSYNC_COMPONENTS_DEFAULT },
> >> >> > +     { "all", FSYNC_COMPONENTS_ALL },
> >> >> > +};
> >> >> > +
> >> >> > +static enum fsync_component parse_fsync_components(const char *var, const char *string)
> >> >> > +{
> >> >> > +     enum fsync_component output = 0;
> >> >> > +
> >> >> > +     if (!strcmp(string, "none"))
> >> >> > +             return output;
> >> >> > +
> >> >> > +     while (string) {
> >> >> > +             int i;
> >> >> > +             size_t len;
> >> >> > +             const char *ep;
> >> >> > +             int negated = 0;
> >> >> > +             int found = 0;
> >> >> > +
> >> >> > +             string = string + strspn(string, ", \t\n\r");
> >> >>
> >> >> Aside from the "use a list" isn't this hardcoding some windows-specific
> >> >> assumptions with \n\r? Maybe not...
> >> >
> >> > I shamelessly stole this code from parse_whitespace_rule. I thought
> >> > about making a helper to be called by both functions, but the amount
> >> > of state going into and out of the wrapper via arguments was
> >> > substantial and seemed to negate the benefit of deduplication.
> >>
> >> FWIW string_list_split() is easier to work with in those cases, or at
> >> least I think so...
> >
> > This code runs at startup for a variable that may be present on some
> > installations.  The nice property of the current patch's code is that
> > it's already a well-tested pattern that doesn't do any allocations as
> > it's working, unlike string_list_split().
>
> Multi-value config would also get you fewer allocations :)
>
> Anyway, I mainly meant to point out that for stuff like this it's fine
> to optimize it for ease rather than micro-optimize allocations. Those
> really aren't a bottleneck on this scale.
>
> Even in that case there's string_list_split_in_place(), which can be a
> bit nicer than manual C-string fiddling.
>

Am I allowed to change the config value string in place? The
core.whitespace code is careful not to modify the string. I kind of
like the parse_ws_error_highlight code a little better now that I've
seen it, but I think the current code is fine too.

> > I hope you know that I appreciate your review feedback, even though
> > I'm pushing back on most of it so far this round. I'll be sending v3
> > to the list soon after giving it another look over.
>
> Sure, no worries. Just hoping to help. If you go for something different
> etc. that's fine. Just hoping to bridge the gap in some knowledge /
> offer potentially interesting suggestions (some of which may be dead
> ends, like the config glob thing...).

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

Hi Ævar,
Could you please respond to the parent message?
To summarize the main points and questions:
1) The current patch series design of core.fsync favors ease of
modification of the complete configuration as a single atomic config
step, at the cost of making it somewhat harder to derive a new
configuration from an existing one. See [1] where this facility is
used.

2) Is there any existing configuration that uses the multi-value
schema you proposed? The diff.wsErrorHighlight setting is actually
comma-separated.

3) Is there any system you can point at or any support in the POSIX
spec for requiring fsync for anything other than durability of data
across system crash?

4) I think string_list_split_in_place is not valid for splitting a
comma-separated list in the config code, since the value coming from
the configset code is in some global data structure that might be used
again. It seems like we could have subtle problems down the line if we
mutate it.

[1] https://github.com/neerajsi-msft/git/commit/7e9749a7e94d26c88789459588997329c5130792#diff-ee0307657f5a76b723c8973db0dbd5a2ca62e14b02711b897418b35d78fc6023R1327

Thanks,
Neeraj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Wed, Dec 08 2021, Neeraj Singh wrote:

[Sorry about the late reply, and thanks for the downthread prodding]

> On Wed, Dec 8, 2021 at 8:46 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>
>>
>> On Wed, Dec 08 2021, Neeraj Singh wrote:
>>
>> > On Wed, Dec 8, 2021 at 2:17 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> >>
>> >>
>> >> On Tue, Dec 07 2021, Neeraj Singh wrote:
>> >>
>> >> > On Tue, Dec 7, 2021 at 5:01 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> >> >>
>> >> >>
>> >> >> On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:
>> >> >>
>> >> >> > From: Neeraj Singh <neerajsi@microsoft.com>
>> >> >> >
>> >> >> > This commit introduces the `core.fsync` configuration
>> >> >> > knob which can be used to control how components of the
>> >> >> > repository are made durable on disk.
>> >> >> >
>> >> >> > This setting allows future extensibility of the list of
>> >> >> > syncable components:
>> >> >> > * We issue a warning rather than an error for unrecognized
>> >> >> >   components, so new configs can be used with old Git versions.
>> >> >>
>> >> >> Looks good!
>> >> >>
>> >> >> > * We support negation, so users can choose one of the default
>> >> >> >   aggregate options and then remove components that they don't
>> >> >> >   want. The user would then harden any new components added in
>> >> >> >   a Git version update.
>> >> >>
>> >> >> I think this config schema makes sense, but just a (I think important)
>> >> >> comment on the "how" not "what" of it. It's really much better to define
>> >> >> config as:
>> >> >>
>> >> >>     [some]
>> >> >>     key = value
>> >> >>     key = value2
>> >> >>
>> >> >> Than:
>> >> >>
>> >> >>     [some]
>> >> >>     key = value,value2
>> >> >>
>> >> >> The reason is that "git config" has good support for working with
>> >> >> multi-valued stuff, so you can do e.g.:
>> >> >>
>> >> >>     git config --get-all -z some.key
>> >> >>
>> >> >> And you can easily (re)set such config e.g. with --replace-all etc., but
>> >> >> for comma-delimited you (and users) need to do all that work themselves.
>> >> >>
>> >> >> Similarly instead of:
>> >> >>
>> >> >>     some.key = want-this
>> >> >>     some.key = -not-this
>> >> >>     some.key = but-want-this
>> >> >>
>> >> >> I think it's better to just have two lists, one inclusive another
>> >> >> exclusive. E.g. see "log.decorate" and "log.excludeDecoration",
>> >> >> "transfer.hideRefs"
>> >> >>
>> >> >> Which would mean:
>> >> >>
>> >> >>     core.fsync = want-this
>> >> >>     core.fsyncExcludes = -not-this
>> >> >>
>> >> >> For some value of "fsyncExcludes", maybe "noFsync"? Anyway, just a
>> >> >> suggestion on making this easier for users & the implementation.
>> >> >>
>> >> >
>> >> > Maybe there's some way to handle this I'm unaware of, but a
>> >> > disadvantage of your multi-valued config proposal is that it's harder,
>> >> > for example, for a per-repo config store to reasonably override a
>> >> > per-user config store.  With the configuration scheme as-is, I can
>> >> > have a per-user setting like `core.fsync=all` which covers my typical
>> >> > repos, but then have a maintainer repo with a private setting of
>> >> > `core.fsync=none` to speed up cases where I'm mostly working with
>> >> > other people's changes that are backed up in email or server-side
>> >> > repos.  The latter setting conveniently overrides the former setting
>> >> > in all aspects.
>> >>
>> >> Even if you turn just your comma-delimited proposal into a list proposal
>> >> can't we just say that the last one wins? Then it won't matter what cmae
>> >> before, you'd specify "core.fsync=none" in your local .git/config.
>> >>
>> >> But this is also a general issue with a bunch of things in git's config
>> >> space. I'd rather see us use the list-based values and just come up with
>> >> some general way to reset them that works for all keys, rather than
>> >> regretting having comma-delimited values that'll be harder to work with
>> >> & parse, which will be a legacy wart if/when we come up with a way to
>> >> say "reset all previous settings".
>> >>
>> >> > Also, with the core.fsync and core.fsyncExcludes, how would you spell
>> >> > "don't sync anything"? Would you still have the aggregate options.?
>> >>
>> >>     core.fsyncExcludes = *
>> >>
>> >> I.e. the same as the core.fsync=none above, anyway I still like the
>> >> wildcard thing below a bit more...
>> >
>> > I'm not going to take this feedback unless there are additional votes
>> > from the Git community in this direction.  I make the claim that
>> > single-valued comma-separated config lists are easier to work with in
>> > the existing Git infrastructure.
>>
>> Easier in what sense? I showed examples of how "git-config" trivially
>> works with multi-valued config, but for comma-delimited you'll need to
>> write your own shellscript boilerplate around simple things like adding
>> values, removing existing ones etc.
>>
>> I.e. you can use --add, --unset, --unset-all, --get, --get-all etc.
>>
>
> I see what you're saying for cases where someone would want to set a
> core.fsync setting that's derived from the user's current config.  But
> I'm guessing that the dominant use case is someone setting a new fsync
> configuration that achieves some atomic goal with respect to a given
> repository. Like "this is a throwaway, so sync nothing" or "this is
> really important, so sync all objects and refs and the index".

Whether it's multi-value or comma-separated you could do:

    -c core.fsync=[none|false]

To sync nothing, i.e. if we see "none/false" it doesn't matter if we saw
core.fsync=loose-object or whatever before, it would override it, ditto
for "all".

>> > parsing code for the core.whitespace variable and users are used to
>> > this syntax there. There are several other comma-separated lists in
>> > the config space, so this construct has precedence and will be with
>> > Git for some time.
>>
>> That's not really an argument either way for why we'd pick X over Y for
>> something new. We've got some comma-delimited, some multi-value (I'm
>> fairly sure we have more multi-value).
>>
>
> My main point here is that there's precedent for patch's current exact
> schema for a config in the same core config leaf of the Documentation.
> It seems from your comments that we'd have to invent and document some
> new convention for "reset" of a multi-valued config.  So you're
> suggesting that I solve an extra set of problems to get this change
> in.  Just want to remind you that my personal itch to scratch is to
> get the underlying mechanism in so that git-for-windows can set its
> default setting to batch mode. I'm not expecting many users to
> actually configure this setting to any non-default value.

Me neither. I think people will most likely set this once in
~/.gitconfig or /etc/gitconfig.

We have some config keys that are multi-value and either comma-separated
or space-separated, e.g. core.alternateRefsPrefixes

Then we have e.g. blame.ignoreRevsFile which is multi-value, and further
has the convention that setting it to an empty value clears the
list. which would scratch the "override existing" itch.

format.notes, versionsort.suffix, transfer.hideRefs, branch.<name>.merge
are exmples of existing multi-value config.

>> > Also, fsync configurations aren't composable like
>> > some other configurations may be. It makes sense to have a holistic
>> > singular fsync configuration, which is best represented by a single
>> > variable.
>>
>> What's a "variable" here? We call these "keys", you can have a
>> single-value key like user.name that you get with --get, or a
>> multi-value key like say branch.<name>.merge or push.pushOption that
>> you'd get with --get-all.
>
> Yeah, I meant "key".  I conflated the config key with the underlying
> global variable in git.

*nod*

>> I think you may be referring to either not wanting these to be
>> "inherited" (which is not really a think we do for anything else in
>> config), or lacking the ability to "reset".
>>
>> For the latter if that's absolutely needed we could just use the same
>> trick as "diff.wsErrorHighlight" uses of making an empty value "reset"
>> the list, and you'd get better "git config" support for editing it.
>>
>
> My reading of the code is that diff.wsErrorHighlight is a comma
> separated list and not a multi-valued config.  Actually I haven't yet
> found an existing multi-valued config (not sure how to grep for it).

Yes, I think I conflated it with one of the ones above when I wrote
this.

>> >> >> > This also supports the common request of doing absolutely no
>> >> >> > fysncing with the `core.fsync=none` value, which is expected
>> >> >> > to make the test suite faster.
>> >> >>
>> >> >> Let's just use the git_parse_maybe_bool() or git_parse_maybe_bool_text()
>> >> >> so we'll accept "false", "off", "no" like most other such config?
>> >> >
>> >> > Junio's previous feedback when discussing batch mode [1] was to offer
>> >> > less flexibility when parsing new values of these configuration
>> >> > options. I agree with his statement that "making machine-readable
>> >> > tokens be spelled in different ways is a 'disease'."  I'd like to
>> >> > leave this as-is so that the documentation can clearly state the exact
>> >> > set of allowable values.
>> >> >
>> >> > [1] https://lore.kernel.org/git/xmqqr1dqzyl7.fsf@gitster.g/
>> >>
>> >> I think he's talking about batch, Batch, BATCH, bAtCh etc. there. But
>> >> the "maybe bool" is a stanard pattern we use.
>> >>
>> >> I don't think we'd call one of these 0, off, no or false etc. to avoid
>> >> confusion, so then you can use git_parse_maybe_...()
>> >
>> > I don't see the advantage of having multiple ways of specifying
>> > "none".  The user can read the doc and know exactly what to write.  If
>> > they write something unallowable, they get a clear warning and they
>> > can read the doc again to figure out what to write.  This isn't a
>> > boolean options at all, so why should we entertain bool-like ways of
>> > spelling it?
>>
>> It's not boolean, it's multi-value and one of the values includes a true
>> or false boolean value. You just spell it "none".
>>
>> I think both this and your comment above suggest that you think there's
>> no point in this because you haven't interacted with/used "git config"
>> as a command line or API mechanism, but have just hand-crafted config
>> files.
>>
>> That's fair enough, but there's a lot of tooling that benefits from the
>> latter.
>
> My batch mode perf tests (on github, not yet submitted to the list)
> use `git -c core.fsync=<foo>` to set up a per-process config.  I
> haven't used the `git config` writing support in a while, so I haven't
> deeply thought about that.  However, I favor simplifying the use case
> of "atomically" setting a new holistic core.fsync value versus the use
> case of deriving a new core.fsync value from the preexisting value.

If you implement it like blame.ignoreRevsFile you can have your cake and
eat it too, i.e.:

    -c core.fsync= core.fsync=loose-object

ensures only loose objects are synced, as with your single-value config,
but I'd think what you'd be more likely to actually mean would be:

    # With "core.fsync=pack" set in ~/.gitconfig
    -c core.fsync=loose-object

I.e. that the common case is "I want this to be synced here", not that
you'd like to declare sync policy from scratch every time.

In any case, on this general topic my main point is that the
git-config(1) command has pretty integration for multi-value if you do
it that way, and not for comma-delimited. I.e. you get --add, --unset,
--unset-all, --get, --get-all etc.

So I think for anything new it makes sense to lean into that, I think
most of these existing comma-delimited ones are ones we'd do differently
today on reflection.

And if you suppor the "empty resets" like blame.ignoreRevsFile it seems
to me you'll have your cake & eat it too.

>> E.g.:
>>
>>     $ git -c core.fsync=off config --type=bool core.fsync
>>     false
>>     $ git -c core.fsync=blah config --type=bool core.fsync
>>     fatal: bad boolean config value 'blah' for 'core.fsync'
>>
>> Here we can get 'git config' to normalize what you call 'none', and you
>> can tell via exit codes/normalization if it's "false". But if you invent
>> a new term for "false" you can't do that as easily.
>>
>> We have various historical keys that take odd exceptions to that,
>> e.g. "never", but unless we have a good reason to let's not invent more
>> exceptions.
>>
>> >> >> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
>> >> >> > ---
>> >> >> >  Documentation/config/core.txt | 27 +++++++++----
>> >> >> >  builtin/fast-import.c         |  2 +-
>> >> >> >  builtin/index-pack.c          |  4 +-
>> >> >> >  builtin/pack-objects.c        |  8 ++--
>> >> >> >  bulk-checkin.c                |  5 ++-
>> >> >> >  cache.h                       | 39 +++++++++++++++++-
>> >> >> >  commit-graph.c                |  3 +-
>> >> >> >  config.c                      | 76 ++++++++++++++++++++++++++++++++++-
>> >> >> >  csum-file.c                   |  5 ++-
>> >> >> >  csum-file.h                   |  3 +-
>> >> >> >  environment.c                 |  1 +
>> >> >> >  midx.c                        |  3 +-
>> >> >> >  object-file.c                 |  3 +-
>> >> >> >  pack-bitmap-write.c           |  3 +-
>> >> >> >  pack-write.c                  | 13 +++---
>> >> >> >  read-cache.c                  |  2 +-
>> >> >> >  16 files changed, 164 insertions(+), 33 deletions(-)
>> >> >> >
>> >> >> > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
>> >> >> > index dbb134f7136..4f1747ec871 100644
>> >> >> > --- a/Documentation/config/core.txt
>> >> >> > +++ b/Documentation/config/core.txt
>> >> >> > @@ -547,6 +547,25 @@ core.whitespace::
>> >> >> >    is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
>> >> >> >    errors. The default tab width is 8. Allowed values are 1 to 63.
>> >> >> >
>> >> >> > +core.fsync::
>> >> >> > +     A comma-separated list of parts of the repository which should be
>> >> >> > +     hardened via the core.fsyncMethod when created or modified. You can
>> >> >> > +     disable hardening of any component by prefixing it with a '-'. Later
>> >> >> > +     items take precedence over earlier ones in the list. For example,
>> >> >> > +     `core.fsync=all,-pack-metadata` means "harden everything except pack
>> >> >> > +     metadata." Items that are not hardened may be lost in the event of an
>> >> >> > +     unclean system shutdown.
>> >> >> > ++
>> >> >> > +* `none` disables fsync completely. This must be specified alone.
>> >> >> > +* `loose-object` hardens objects added to the repo in loose-object form.
>> >> >> > +* `pack` hardens objects added to the repo in packfile form.
>> >> >> > +* `pack-metadata` hardens packfile bitmaps and indexes.
>> >> >> > +* `commit-graph` hardens the commit graph file.
>> >> >> > +* `objects` is an aggregate option that includes `loose-objects`, `pack`,
>> >> >> > +  `pack-metadata`, and `commit-graph`.
>> >> >> > +* `default` is an aggregate option that is equivalent to `objects,-loose-object`
>> >> >> > +* `all` is an aggregate option that syncs all individual components above.
>> >> >> > +
>> >> >>
>> >> >> It's probably a *bit* more work to set up, but I wonder if this wouldn't
>> >> >> be simpler if we just said (and this is partially going against what I
>> >> >> noted above):
>> >> >>
>> >> >> == BEGIN DOC
>> >> >>
>> >> >> core.fsync is a multi-value config variable where each item is a
>> >> >> pathspec that'll get matched the same way as 'git-ls-files' et al.
>> >> >>
>> >> >> When we sync pretend that a path like .git/objects/de/adbeef... is
>> >> >> relative to the top-level of the git
>> >> >> directory. E.g. "objects/de/adbeaf.." or "objects/pack/...".
>> >> >>
>> >> >> You can then supply a list of wildcards and exclusions to configure
>> >> >> syncing.  or "false", "off" etc. to turn it off. These are synonymous
>> >> >> with:
>> >> >>
>> >> >>     ; same as "false"
>> >> >>     core.fsync = ":!*"
>> >> >>
>> >> >> Or:
>> >> >>
>> >> >>     ; same as "true"
>> >> >>     core.fsync = "*"
>> >> >>
>> >> >> Or, to selectively sync some things and not others:
>> >> >>
>> >> >>     ;; Sync objects, but not "info"
>> >> >>     core.fsync = ":!objects/info/**"
>> >> >>     core.fsync = "objects/**"
>> >> >>
>> >> >> See gitrepository-layout(5) for details about what sort of paths you
>> >> >> might be expected to match. Not all paths listed there will go through
>> >> >> this mechanism (e.g. currently objects do, but nothing to do with config
>> >> >> does).
>> >> >>
>> >> >> We can and will match this against "fake paths", e.g. when writing out
>> >> >> packs we may match against just the string "objects/pack", we're not
>> >> >> going to re-check if every packfile we're writing matches your globs,
>> >> >> ditto for loose objects. Be reasonable!
>> >> >>
>> >> >> This metharism is intended as a shorthand that provides some flexibility
>> >> >> when fsyncing, while not forcing git to come up with labels for all
>> >> >> paths the git dir, or to support crazyness like "objects/de/adbeef*"
>> >> >>
>> >> >> More paths may be added or removed in the future, and we make no
>> >> >> promises that we won't move things around, so if in doubt use
>> >> >> e.g. "true" or a wide pattern match like "objects/**". When in doubt
>> >> >> stick to the golden path of examples provided in this documentation.
>> >> >>
>> >> >> == END DOC
>> >> >>
>> >> >>
>> >> >> It's a tad more complex to set up, but I wonder if that isn't worth
>> >> >> it. It nicely gets around any current and future issues of deciding what
>> >> >> labels such as "loose-object" etc. to pick, as well as slotting into an
>> >> >> existing method of doing exclude/include lists.
>> >> >>
>> >> >
>> >> > I think this proposal is a lot of complexity to avoid coming up with a
>> >> > new name for syncable things as they are added to Git.  A path based
>> >> > mechanism makes it hard to document for the (advanced) user what the
>> >> > full set of things is and how it might change from release to release.
>> >> > I think the current core.fsync scheme is a bit easier to understand,
>> >> > query, and extend.
>> >>
>> >> We document it in gitrepository-layout(5). Yeah it has some
>> >> disadvantages, but one advantage is that you could make the
>> >> composability easy. I.e. if last exclude wins then a setting of:
>> >>
>> >>     core.fsync = ":!*"
>> >>     core.fsync = "objects/**"
>> >>
>> >> Would reset all previous matches & only match objects/**.
>> >>
>> >
>> > The value of changing this is predicated on taking your previous
>> > multi-valued config proposal, which I'm still not at all convinced
>> > about.
>>
>> They're orthagonal. I.e. you get benefits from multi-value with or
>> without this globbing mechanism.
>>
>> In any case, I don't feel strongly about/am really advocating this
>> globbing mechanism. I just wondered if it wouldn't make things simpler
>> since it would sidestep the need to create any sort of categories for
>> subsets of gitrepository-layout(5), but maybe not...
>>
>> > The schema in the current (v1-v2) version of the patch already
>> > includes an example of extending the list of syncable things, and
>> > Patrick Steinhardt made it clear that he feels comfortable adding
>> > 'refs' to the same schema in a future change.
>> >
>> > I'll also emphasize that we're talking about a non-functional,
>> > relatively corner-case behavioral configuration.  These values don't
>> > change how git's interface behaves except when the system crashes
>> > during a git command or shortly after one completes.
>>
>> That may be something some OS's promise, but it's not something fsync()
>> or POSIX promises. I.e. you might write a ref, but unless you fsync and
>> the relevant dir entries another process might not see the update, crash
>> or not.
>>
>
> I haven't seen any indication that POSIX requires an fsync for
> visiblity within a running system.  I looked at the doc for open,
> write, and fsync, and saw no indication that it's posix compliant to
> require an fsync for visibility.  I think an OS that required fsync
> for cross-process visiblity would fail to run Git for a myriad of
> other reasons and would likely lose all its users.  I'm curious where
> you've seen documentation that allows such unhelpful behavior?

There's multiple unrelated and related things in this area. One is a
case where you'll e.g. write a file "foo" using stdio, spawn a program
to work on it in the same program, but it might not see it at all, or
see empty content, the latter being because you haven't flushed your I/O
buffers (which you can do via fsync()).

The former is that on *nix systems you're generally only guaranteed to
write to a fd, but not to have the associated metadata be synced for
you.

That is spelled out e.g. in the DESCRIPTION section of linux's fsync()
manpage: https://man7.org/linux/man-pages/man2/fdatasync.2.html

I don't know how much you follow non-Windows FS development, but there
was also a very well known "incident" early in ext4 where it leaned into
some permissive-by-POSIX behavior that caused data loss in practice on
buggy programs that didn't correctly use fsync() , since various tooling
had come to expect the stricter behavior of ext3:
https://lwn.net/Articles/328363/

That was explicitly in the area of fs metadata being discussed here.

Generally you can expect your VFS layer to be forgiving when it comes to
IPC, but even that is out the window when it comes to networked
filesystems, e.g. a shared git repository hosted on NFS.

>> That's an aside from these config design questions, and I think
>> most/(all?) OS's anyone cares about these days tend to make that
>> implicit promise as part of their VFS behavior, but we should probably
>> design such an interface to fsync() with such pedantic portability in
>> mind.
>
> Why? To be rude to such a hypothetical system, if a system were so
> insanely designed, it would be nuts to support it.

Because we know that right now the system calls we're invoking aren't
guaranteed to store data persistently to disk portably, although they do
so in practice on most modern OS's.

We're portably to a lot of platforms, and also need to keep e.g. NFS in
mind, so being able to ask for a pedantic mode when you care about data
retention at the cost of performance would be nice.

And because the fsync config mode you're proposing is thoroughly
non-standard, but is known to me much faster by leaning into known
attributes of specific FS's on specific OS's, if we're not running on
those it would be sensible to fall back to a stricter mode of
operation. E.g. syncing all 100 loose objects we just wrote, not just
the last one.

>> > While you may not personally love the proposed configuration
>> > interface, I'd want your view on some questions:
>> > 1. Is it easy for the (advanced) user to set a configuration?
>> > 2. Is it easy for the (advanced) user to see what was configured?
>> > 3. Is it easy for the Git community to build on this as we want to add
>> > things to the list of things to sync?
>> >     a) Is there a good best practice configuration so that people can
>> > avoid losing integrity for new stuff that they are intending to sync.
>> >     b) If someone has a custom configuration, can that custom
>> > configuration do something reasonable as they upgrade versions of Git?
>> >              ** In response to this question, I might see some value
>> > in adding a 'derived-metadata' aggregate that can be disabled so that
>> > a custom configuration can exclude those as they change version to
>> > version.
>> >     c) Is it too much maintenance overhead to consider how to present
>> > this configuration knob for any new hashfile or other datafile in the
>> > git repo?
>> > 4. Is there a good path forward to change the default syncable set,
>> > both in git-for-windows and in Git for other platforms?
>>
>> I'm not really sure this globbing this is a good idea, as noted above
>> just a suggestion etc.
>>
>> As noted there it just gets you out of the business of re-defining
>> gitrepository-layout(5), and assuming too much in advance about certain
>> use-cases.
>>
>> E.g. even "refs" might be too broad for some. I don't tend to be I/O
>> limited, but I could see how someone who would be would care about
>> refs/heads but not refs/remotes, or want to exclude logs/* but not the
>> refs updates themselves etc.
>
> This use-case is interesting (distinguishing remote refs from local
> refs).  I think the difficulty of verifying (for even an advanced
> user) that the right fsyncing is actually happening still puts me on
> the side of having a carefully curated and documented set of syncable
> things rather than a file-path-based mechanism.
>
> Is this meaningful in the presumably nearby future world of the refsdb
> backend?  Is that somehow split by remote versus local?

There is the upcoming "reftable" work, but that's probably 2-3 years out
at the earliest for series production workloads in git.git.

>> >> >> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>> >> >> > index 857be7826f3..916c55d6ce9 100644
>> >> >> > --- a/builtin/pack-objects.c
>> >> >> > +++ b/builtin/pack-objects.c
>> >> >> > @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
>> >> >> >                * If so, rewrite it like in fast-import
>> >> >> >                */
>> >> >> >               if (pack_to_stdout) {
>> >> >> > -                     finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>> >> >> > +                     finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
>> >> >> > +                                       CSUM_HASH_IN_STREAM | CSUM_CLOSE);
>> >> >>
>> >> >> Not really related to this per-se, but since you're touching the API
>> >> >> everything goes through I wonder if callers should just always try to
>> >> >> fsync, and we can just catch EROFS and EINVAL in the wrapper if someone
>> >> >> tries to flush stdout, or catch the fd at that lower level.
>> >> >>
>> >> >> Or maybe there's a good reason for this...
>> >> >
>> >> > It's platform dependent, but I'd expect fsync would do something for
>> >> > pipes or stdout redirected to a file.  In these cases we really don't
>> >> > want to fsync since we have no idea what we're talking to and we're
>> >> > potentially worsening performance for probably no benefit.
>> >>
>> >> Yeah maybe we should just leave it be.
>> >>
>> >> I'd think the C library returning EINVAL would be a trivial performance
>> >> cost though.
>> >>
>> >> It just seemed odd to hardcode assumptions about what can and can't be
>> >> synced when the POSIX defined function will also tell us that.
>> >>
>> >
>> > Redirecting stdout to a file seems like a common usage for this
>> > command. That would definitely be fsyncable, but Git has no idea what
>> > its proper category is since there's no way to know the purpose or
>> > lifetime of the packfile.  I'm going to leave this be, because I'd
>> > posit that "can it be fsynced?" is not the same as "should it be
>> > fsynced?".  The latter question can't be answered for stdout.
>>
>> As noted this was just an aside, and I don't even know if any OS would
>> do anything meaningful with an fsync() of such a FD anyway.
>>
>
> The underlying fsync primitive does have a meaning on Windows for
> pipes, but it's certainly not what Git would want to do. Also if
> stdout is redirected to a file, I'm pretty sure that UNIX OSes would
> respect the fsync call.  However it's not meaningful in the sense of
> the git repository, since we don't know what the packfile is or why it
> was created.

I suggested that because I think it's probably nonsensical, but it's
nonsense that POSIX seems to explicitly tell us that it'll handle
(probably by silently doing nothing). So in terms of our interface we
could lean into that and avoid our own special-casing.

>> I just don't see why we wouldn't say:
>>
>>  1. We're syncing this category of thing
>>  2. Try it
>>  3. If fsync returns "can't fsync that sort of thing" move on
>>
>> As opposed to trying to shortcut #3 by doing the detection ourselves.
>>
>> I.e. maybe there was a good reason, but it seemed to be some easy
>> potential win for more simplification, since you were re-doing and
>> simplifying some of the interface anyway...
>
> We're trying to be deliberate about what we're fsyncing.  Fsyncing an
> unknown file created by the packfile code doesn't move us in that
> direction.  In your taxonomy we don't know (1), "what is this category
> of thing?"  Sure it's got the packfile format, but is not known to be
> an actual packfile that's part of the repository.

We know it's a fd, isn't that sufficient? In any case, I'm fine with
also keeping it as is, I don't mean to split hairs here.

It just stuck out as an odd part of the interface, why treat some fd's
specially, instead of just throwing it all at the OS. Presumably the
first thing the OS will do is figure out if it's a syncable fd or not,
and act appropriately.

>> >>
>> >> >> > [...]
>> >> >> > +/*
>> >> >> > + * These values are used to help identify parts of a repository to fsync.
>> >> >> > + * FSYNC_COMPONENT_NONE identifies data that will not be a persistent part of the
>> >> >> > + * repository and so shouldn't be fsynced.
>> >> >> > + */
>> >> >> > +enum fsync_component {
>> >> >> > +     FSYNC_COMPONENT_NONE                    = 0,
>> >> >>
>> >> >> I haven't read ahead much but in most other such cases we don't define
>> >> >> the "= 0", just start at 1<<0, then check the flags elsewhere...
>> >> >>
>> >> >> > +static const struct fsync_component_entry {
>> >> >> > +     const char *name;
>> >> >> > +     enum fsync_component component_bits;
>> >> >> > +} fsync_component_table[] = {
>> >> >> > +     { "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
>> >> >> > +     { "pack", FSYNC_COMPONENT_PACK },
>> >> >> > +     { "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
>> >> >> > +     { "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
>> >> >> > +     { "objects", FSYNC_COMPONENTS_OBJECTS },
>> >> >> > +     { "default", FSYNC_COMPONENTS_DEFAULT },
>> >> >> > +     { "all", FSYNC_COMPONENTS_ALL },
>> >> >> > +};
>> >> >> > +
>> >> >> > +static enum fsync_component parse_fsync_components(const char *var, const char *string)
>> >> >> > +{
>> >> >> > +     enum fsync_component output = 0;
>> >> >> > +
>> >> >> > +     if (!strcmp(string, "none"))
>> >> >> > +             return output;
>> >> >> > +
>> >> >> > +     while (string) {
>> >> >> > +             int i;
>> >> >> > +             size_t len;
>> >> >> > +             const char *ep;
>> >> >> > +             int negated = 0;
>> >> >> > +             int found = 0;
>> >> >> > +
>> >> >> > +             string = string + strspn(string, ", \t\n\r");
>> >> >>
>> >> >> Aside from the "use a list" isn't this hardcoding some windows-specific
>> >> >> assumptions with \n\r? Maybe not...
>> >> >
>> >> > I shamelessly stole this code from parse_whitespace_rule. I thought
>> >> > about making a helper to be called by both functions, but the amount
>> >> > of state going into and out of the wrapper via arguments was
>> >> > substantial and seemed to negate the benefit of deduplication.
>> >>
>> >> FWIW string_list_split() is easier to work with in those cases, or at
>> >> least I think so...
>> >
>> > This code runs at startup for a variable that may be present on some
>> > installations.  The nice property of the current patch's code is that
>> > it's already a well-tested pattern that doesn't do any allocations as
>> > it's working, unlike string_list_split().
>>
>> Multi-value config would also get you fewer allocations :)
>>
>> Anyway, I mainly meant to point out that for stuff like this it's fine
>> to optimize it for ease rather than micro-optimize allocations. Those
>> really aren't a bottleneck on this scale.
>>
>> Even in that case there's string_list_split_in_place(), which can be a
>> bit nicer than manual C-string fiddling.
>>
>
> Am I allowed to change the config value string in place? The
> core.whitespace code is careful not to modify the string. I kind of
> like the parse_ws_error_highlight code a little better now that I've
> seen it, but I think the current code is fine too.

I don't remember offhand if that's safe, probably not. So you'll need a
copy here.

>> > I hope you know that I appreciate your review feedback, even though
>> > I'm pushing back on most of it so far this round. I'll be sending v3
>> > to the list soon after giving it another look over.
>>
>> Sure, no worries. Just hoping to help. If you go for something different
>> etc. that's fine. Just hoping to bridge the gap in some knowledge /
>> offer potentially interesting suggestions (some of which may be dead
>> ends, like the config glob thing...).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Jan 18 2022, Neeraj Singh wrote:

> Hi Ævar,
> Could you please respond to the parent message?

I did just now in
https://lore.kernel.org/git/220119.8635ljoidt.gmgdl@evledraar.gmail.com/;
sorry about the delay. This thread fell off my radar.

> To summarize the main points and questions:
> 1) The current patch series design of core.fsync favors ease of
> modification of the complete configuration as a single atomic config
> step, at the cost of making it somewhat harder to derive a new
> configuration from an existing one. See [1] where this facility is
> used.
>
> 2) Is there any existing configuration that uses the multi-value
> schema you proposed? The diff.wsErrorHighlight setting is actually
> comma-separated.

I replied to that. To add a bit to that the comma-delimited thing isn't
any sort of a "blocker" or whatever in my mind.

I just wanted to point out that you could get the same with multi-value
with some better integration, and we might have our cake & eat it too.

But at the end of the day if you disagree you're doing the work, and
that's ultimately bikeshedding of the config interface. I'm fine with it
either way.

> 3) Is there any system you can point at or any support in the POSIX
> spec for requiring fsync for anything other than durability of data
> across system crash?

Some examples in the linked...

> 4) I think string_list_split_in_place is not valid for splitting a
> comma-separated list in the config code, since the value coming from
> the configset code is in some global data structure that might be used
> again. It seems like we could have subtle problems down the line if we
> mutate it.

Replied to that too, hope it's all useful. Thanks again for working on
this, it's really nice to have someone move the state of data integrity
in git forward.

> [1] https://github.com/neerajsi-msft/git/commit/7e9749a7e94d26c88789459588997329c5130792#diff-ee0307657f5a76b723c8973db0dbd5a2ca62e14b02711b897418b35d78fc6023R1327

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Wed, Jan 19, 2022 at 10:27 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Wed, Dec 08 2021, Neeraj Singh wrote:
>
> [Sorry about the late reply, and thanks for the downthread prodding]

I also apologize for the late reply here.  I've been dealing with a
hospitalized parent this week.

>
> > On Wed, Dec 8, 2021 at 8:46 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >>
> >>
> >> On Wed, Dec 08 2021, Neeraj Singh wrote:
> >>
> >> > On Wed, Dec 8, 2021 at 2:17 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >> >>
> >> >>
> >> >> On Tue, Dec 07 2021, Neeraj Singh wrote:
> >> >>
> >> >> > On Tue, Dec 7, 2021 at 5:01 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Dec 07 2021, Neeraj Singh via GitGitGadget wrote:
> >> >> >>
> >> >> >> > From: Neeraj Singh <neerajsi@microsoft.com>
> >> >> >> >
> >> >> >> > This commit introduces the `core.fsync` configuration
> >> >> >> > knob which can be used to control how components of the
> >> >> >> > repository are made durable on disk.
> >> >> >> >
> >> >> >> > This setting allows future extensibility of the list of
> >> >> >> > syncable components:
> >> >> >> > * We issue a warning rather than an error for unrecognized
> >> >> >> >   components, so new configs can be used with old Git versions.
> >> >> >>
> >> >> >> Looks good!
> >> >> >>
> >> >> >> > * We support negation, so users can choose one of the default
> >> >> >> >   aggregate options and then remove components that they don't
> >> >> >> >   want. The user would then harden any new components added in
> >> >> >> >   a Git version update.
> >> >> >>
> >> >> >> I think this config schema makes sense, but just a (I think important)
> >> >> >> comment on the "how" not "what" of it. It's really much better to define
> >> >> >> config as:
> >> >> >>
> >> >> >>     [some]
> >> >> >>     key = value
> >> >> >>     key = value2
> >> >> >>
> >> >> >> Than:
> >> >> >>
> >> >> >>     [some]
> >> >> >>     key = value,value2
> >> >> >>
> >> >> >> The reason is that "git config" has good support for working with
> >> >> >> multi-valued stuff, so you can do e.g.:
> >> >> >>
> >> >> >>     git config --get-all -z some.key
> >> >> >>
> >> >> >> And you can easily (re)set such config e.g. with --replace-all etc., but
> >> >> >> for comma-delimited you (and users) need to do all that work themselves.
> >> >> >>
> >> >> >> Similarly instead of:
> >> >> >>
> >> >> >>     some.key = want-this
> >> >> >>     some.key = -not-this
> >> >> >>     some.key = but-want-this
> >> >> >>
> >> >> >> I think it's better to just have two lists, one inclusive another
> >> >> >> exclusive. E.g. see "log.decorate" and "log.excludeDecoration",
> >> >> >> "transfer.hideRefs"
> >> >> >>
> >> >> >> Which would mean:
> >> >> >>
> >> >> >>     core.fsync = want-this
> >> >> >>     core.fsyncExcludes = -not-this
> >> >> >>
> >> >> >> For some value of "fsyncExcludes", maybe "noFsync"? Anyway, just a
> >> >> >> suggestion on making this easier for users & the implementation.
> >> >> >>
> >> >> >
> >> >> > Maybe there's some way to handle this I'm unaware of, but a
> >> >> > disadvantage of your multi-valued config proposal is that it's harder,
> >> >> > for example, for a per-repo config store to reasonably override a
> >> >> > per-user config store.  With the configuration scheme as-is, I can
> >> >> > have a per-user setting like `core.fsync=all` which covers my typical
> >> >> > repos, but then have a maintainer repo with a private setting of
> >> >> > `core.fsync=none` to speed up cases where I'm mostly working with
> >> >> > other people's changes that are backed up in email or server-side
> >> >> > repos.  The latter setting conveniently overrides the former setting
> >> >> > in all aspects.
> >> >>
> >> >> Even if you turn just your comma-delimited proposal into a list proposal
> >> >> can't we just say that the last one wins? Then it won't matter what cmae
> >> >> before, you'd specify "core.fsync=none" in your local .git/config.
> >> >>
> >> >> But this is also a general issue with a bunch of things in git's config
> >> >> space. I'd rather see us use the list-based values and just come up with
> >> >> some general way to reset them that works for all keys, rather than
> >> >> regretting having comma-delimited values that'll be harder to work with
> >> >> & parse, which will be a legacy wart if/when we come up with a way to
> >> >> say "reset all previous settings".
> >> >>
> >> >> > Also, with the core.fsync and core.fsyncExcludes, how would you spell
> >> >> > "don't sync anything"? Would you still have the aggregate options.?
> >> >>
> >> >>     core.fsyncExcludes = *
> >> >>
> >> >> I.e. the same as the core.fsync=none above, anyway I still like the
> >> >> wildcard thing below a bit more...
> >> >
> >> > I'm not going to take this feedback unless there are additional votes
> >> > from the Git community in this direction.  I make the claim that
> >> > single-valued comma-separated config lists are easier to work with in
> >> > the existing Git infrastructure.
> >>
> >> Easier in what sense? I showed examples of how "git-config" trivially
> >> works with multi-valued config, but for comma-delimited you'll need to
> >> write your own shellscript boilerplate around simple things like adding
> >> values, removing existing ones etc.
> >>
> >> I.e. you can use --add, --unset, --unset-all, --get, --get-all etc.
> >>
> >
> > I see what you're saying for cases where someone would want to set a
> > core.fsync setting that's derived from the user's current config.  But
> > I'm guessing that the dominant use case is someone setting a new fsync
> > configuration that achieves some atomic goal with respect to a given
> > repository. Like "this is a throwaway, so sync nothing" or "this is
> > really important, so sync all objects and refs and the index".
>
> Whether it's multi-value or comma-separated you could do:
>
>     -c core.fsync=[none|false]
>
> To sync nothing, i.e. if we see "none/false" it doesn't matter if we saw
> core.fsync=loose-object or whatever before, it would override it, ditto
> for "all".
>
> >> > parsing code for the core.whitespace variable and users are used to
> >> > this syntax there. There are several other comma-separated lists in
> >> > the config space, so this construct has precedence and will be with
> >> > Git for some time.
> >>
> >> That's not really an argument either way for why we'd pick X over Y for
> >> something new. We've got some comma-delimited, some multi-value (I'm
> >> fairly sure we have more multi-value).
> >>
> >
> > My main point here is that there's precedent for patch's current exact
> > schema for a config in the same core config leaf of the Documentation.
> > It seems from your comments that we'd have to invent and document some
> > new convention for "reset" of a multi-valued config.  So you're
> > suggesting that I solve an extra set of problems to get this change
> > in.  Just want to remind you that my personal itch to scratch is to
> > get the underlying mechanism in so that git-for-windows can set its
> > default setting to batch mode. I'm not expecting many users to
> > actually configure this setting to any non-default value.
>
> Me neither. I think people will most likely set this once in
> ~/.gitconfig or /etc/gitconfig.
>
> We have some config keys that are multi-value and either comma-separated
> or space-separated, e.g. core.alternateRefsPrefixes
>
> Then we have e.g. blame.ignoreRevsFile which is multi-value, and further
> has the convention that setting it to an empty value clears the
> list. which would scratch the "override existing" itch.
>
> format.notes, versionsort.suffix, transfer.hideRefs, branch.<name>.merge
> are exmples of existing multi-value config.
>

Thanks for the examples.  I can see the benefit of mutli-value for
most of those settings, but for versionsort.suffix, I'd personally
have wanted a comma-separated list.

> >> > Also, fsync configurations aren't composable like
> >> > some other configurations may be. It makes sense to have a holistic
> >> > singular fsync configuration, which is best represented by a single
> >> > variable.
> >>
> >> What's a "variable" here? We call these "keys", you can have a
> >> single-value key like user.name that you get with --get, or a
> >> multi-value key like say branch.<name>.merge or push.pushOption that
> >> you'd get with --get-all.
> >
> > Yeah, I meant "key".  I conflated the config key with the underlying
> > global variable in git.
>
> *nod*
>
> >> I think you may be referring to either not wanting these to be
> >> "inherited" (which is not really a think we do for anything else in
> >> config), or lacking the ability to "reset".
> >>
> >> For the latter if that's absolutely needed we could just use the same
> >> trick as "diff.wsErrorHighlight" uses of making an empty value "reset"
> >> the list, and you'd get better "git config" support for editing it.
> >>
> >
> > My reading of the code is that diff.wsErrorHighlight is a comma
> > separated list and not a multi-valued config.  Actually I haven't yet
> > found an existing multi-valued config (not sure how to grep for it).
>
> Yes, I think I conflated it with one of the ones above when I wrote
> this.
>
> >> >> >> > This also supports the common request of doing absolutely no
> >> >> >> > fysncing with the `core.fsync=none` value, which is expected
> >> >> >> > to make the test suite faster.
> >> >> >>
> >> >> >> Let's just use the git_parse_maybe_bool() or git_parse_maybe_bool_text()
> >> >> >> so we'll accept "false", "off", "no" like most other such config?
> >> >> >
> >> >> > Junio's previous feedback when discussing batch mode [1] was to offer
> >> >> > less flexibility when parsing new values of these configuration
> >> >> > options. I agree with his statement that "making machine-readable
> >> >> > tokens be spelled in different ways is a 'disease'."  I'd like to
> >> >> > leave this as-is so that the documentation can clearly state the exact
> >> >> > set of allowable values.
> >> >> >
> >> >> > [1] https://lore.kernel.org/git/xmqqr1dqzyl7.fsf@gitster.g/
> >> >>
> >> >> I think he's talking about batch, Batch, BATCH, bAtCh etc. there. But
> >> >> the "maybe bool" is a stanard pattern we use.
> >> >>
> >> >> I don't think we'd call one of these 0, off, no or false etc. to avoid
> >> >> confusion, so then you can use git_parse_maybe_...()
> >> >
> >> > I don't see the advantage of having multiple ways of specifying
> >> > "none".  The user can read the doc and know exactly what to write.  If
> >> > they write something unallowable, they get a clear warning and they
> >> > can read the doc again to figure out what to write.  This isn't a
> >> > boolean options at all, so why should we entertain bool-like ways of
> >> > spelling it?
> >>
> >> It's not boolean, it's multi-value and one of the values includes a true
> >> or false boolean value. You just spell it "none".
> >>
> >> I think both this and your comment above suggest that you think there's
> >> no point in this because you haven't interacted with/used "git config"
> >> as a command line or API mechanism, but have just hand-crafted config
> >> files.
> >>
> >> That's fair enough, but there's a lot of tooling that benefits from the
> >> latter.
> >
> > My batch mode perf tests (on github, not yet submitted to the list)
> > use `git -c core.fsync=<foo>` to set up a per-process config.  I
> > haven't used the `git config` writing support in a while, so I haven't
> > deeply thought about that.  However, I favor simplifying the use case
> > of "atomically" setting a new holistic core.fsync value versus the use
> > case of deriving a new core.fsync value from the preexisting value.
>
> If you implement it like blame.ignoreRevsFile you can have your cake and
> eat it too, i.e.:
>
>     -c core.fsync= core.fsync=loose-object
>
> ensures only loose objects are synced, as with your single-value config,
> but I'd think what you'd be more likely to actually mean would be:
>
>     # With "core.fsync=pack" set in ~/.gitconfig
>     -c core.fsync=loose-object
>
> I.e. that the common case is "I want this to be synced here", not that
> you'd like to declare sync policy from scratch every time.
>
> In any case, on this general topic my main point is that the
> git-config(1) command has pretty integration for multi-value if you do
> it that way, and not for comma-delimited. I.e. you get --add, --unset,
> --unset-all, --get, --get-all etc.
>
> So I think for anything new it makes sense to lean into that, I think
> most of these existing comma-delimited ones are ones we'd do differently
> today on reflection.
>
> And if you suppor the "empty resets" like blame.ignoreRevsFile it seems
> to me you'll have your cake & eat it too.
>

I really don't like the multiple `-c core.fsync=` clauses.  If I want
to do packs and loose-objects as a single config, I'd have to do:
        * multi-value: `git -c core.fsync= -c core.fsync=pack -c
core.fsync=loose-object`
        * comma-sep `git -c core.fsync=pack,loose-object`

Multi-valued configs are stateful and more verbose to configure.
Last-one-wins with comma-separated values has an advantage for
achieving a desired final configuration without regard for the
previous configuration, which is the way I expect the feature to be
used.

> >> E.g.:
> >>
> >>     $ git -c core.fsync=off config --type=bool core.fsync
> >>     false
> >>     $ git -c core.fsync=blah config --type=bool core.fsync
> >>     fatal: bad boolean config value 'blah' for 'core.fsync'
> >>
> >> Here we can get 'git config' to normalize what you call 'none', and you
> >> can tell via exit codes/normalization if it's "false". But if you invent
> >> a new term for "false" you can't do that as easily.
> >>
> >> We have various historical keys that take odd exceptions to that,
> >> e.g. "never", but unless we have a good reason to let's not invent more
> >> exceptions.
> >>
> >> >> >> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> >> >> >> > ---
> >> >> >> >  Documentation/config/core.txt | 27 +++++++++----
> >> >> >> >  builtin/fast-import.c         |  2 +-
> >> >> >> >  builtin/index-pack.c          |  4 +-
> >> >> >> >  builtin/pack-objects.c        |  8 ++--
> >> >> >> >  bulk-checkin.c                |  5 ++-
> >> >> >> >  cache.h                       | 39 +++++++++++++++++-
> >> >> >> >  commit-graph.c                |  3 +-
> >> >> >> >  config.c                      | 76 ++++++++++++++++++++++++++++++++++-
> >> >> >> >  csum-file.c                   |  5 ++-
> >> >> >> >  csum-file.h                   |  3 +-
> >> >> >> >  environment.c                 |  1 +
> >> >> >> >  midx.c                        |  3 +-
> >> >> >> >  object-file.c                 |  3 +-
> >> >> >> >  pack-bitmap-write.c           |  3 +-
> >> >> >> >  pack-write.c                  | 13 +++---
> >> >> >> >  read-cache.c                  |  2 +-
> >> >> >> >  16 files changed, 164 insertions(+), 33 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> >> >> >> > index dbb134f7136..4f1747ec871 100644
> >> >> >> > --- a/Documentation/config/core.txt
> >> >> >> > +++ b/Documentation/config/core.txt
> >> >> >> > @@ -547,6 +547,25 @@ core.whitespace::
> >> >> >> >    is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
> >> >> >> >    errors. The default tab width is 8. Allowed values are 1 to 63.
> >> >> >> >
> >> >> >> > +core.fsync::
> >> >> >> > +     A comma-separated list of parts of the repository which should be
> >> >> >> > +     hardened via the core.fsyncMethod when created or modified. You can
> >> >> >> > +     disable hardening of any component by prefixing it with a '-'. Later
> >> >> >> > +     items take precedence over earlier ones in the list. For example,
> >> >> >> > +     `core.fsync=all,-pack-metadata` means "harden everything except pack
> >> >> >> > +     metadata." Items that are not hardened may be lost in the event of an
> >> >> >> > +     unclean system shutdown.
> >> >> >> > ++
> >> >> >> > +* `none` disables fsync completely. This must be specified alone.
> >> >> >> > +* `loose-object` hardens objects added to the repo in loose-object form.
> >> >> >> > +* `pack` hardens objects added to the repo in packfile form.
> >> >> >> > +* `pack-metadata` hardens packfile bitmaps and indexes.
> >> >> >> > +* `commit-graph` hardens the commit graph file.
> >> >> >> > +* `objects` is an aggregate option that includes `loose-objects`, `pack`,
> >> >> >> > +  `pack-metadata`, and `commit-graph`.
> >> >> >> > +* `default` is an aggregate option that is equivalent to `objects,-loose-object`
> >> >> >> > +* `all` is an aggregate option that syncs all individual components above.
> >> >> >> > +
> >> >> >>
> >> >> >> It's probably a *bit* more work to set up, but I wonder if this wouldn't
> >> >> >> be simpler if we just said (and this is partially going against what I
> >> >> >> noted above):
> >> >> >>
> >> >> >> == BEGIN DOC
> >> >> >>
> >> >> >> core.fsync is a multi-value config variable where each item is a
> >> >> >> pathspec that'll get matched the same way as 'git-ls-files' et al.
> >> >> >>
> >> >> >> When we sync pretend that a path like .git/objects/de/adbeef... is
> >> >> >> relative to the top-level of the git
> >> >> >> directory. E.g. "objects/de/adbeaf.." or "objects/pack/...".
> >> >> >>
> >> >> >> You can then supply a list of wildcards and exclusions to configure
> >> >> >> syncing.  or "false", "off" etc. to turn it off. These are synonymous
> >> >> >> with:
> >> >> >>
> >> >> >>     ; same as "false"
> >> >> >>     core.fsync = ":!*"
> >> >> >>
> >> >> >> Or:
> >> >> >>
> >> >> >>     ; same as "true"
> >> >> >>     core.fsync = "*"
> >> >> >>
> >> >> >> Or, to selectively sync some things and not others:
> >> >> >>
> >> >> >>     ;; Sync objects, but not "info"
> >> >> >>     core.fsync = ":!objects/info/**"
> >> >> >>     core.fsync = "objects/**"
> >> >> >>
> >> >> >> See gitrepository-layout(5) for details about what sort of paths you
> >> >> >> might be expected to match. Not all paths listed there will go through
> >> >> >> this mechanism (e.g. currently objects do, but nothing to do with config
> >> >> >> does).
> >> >> >>
> >> >> >> We can and will match this against "fake paths", e.g. when writing out
> >> >> >> packs we may match against just the string "objects/pack", we're not
> >> >> >> going to re-check if every packfile we're writing matches your globs,
> >> >> >> ditto for loose objects. Be reasonable!
> >> >> >>
> >> >> >> This metharism is intended as a shorthand that provides some flexibility
> >> >> >> when fsyncing, while not forcing git to come up with labels for all
> >> >> >> paths the git dir, or to support crazyness like "objects/de/adbeef*"
> >> >> >>
> >> >> >> More paths may be added or removed in the future, and we make no
> >> >> >> promises that we won't move things around, so if in doubt use
> >> >> >> e.g. "true" or a wide pattern match like "objects/**". When in doubt
> >> >> >> stick to the golden path of examples provided in this documentation.
> >> >> >>
> >> >> >> == END DOC
> >> >> >>
> >> >> >>
> >> >> >> It's a tad more complex to set up, but I wonder if that isn't worth
> >> >> >> it. It nicely gets around any current and future issues of deciding what
> >> >> >> labels such as "loose-object" etc. to pick, as well as slotting into an
> >> >> >> existing method of doing exclude/include lists.
> >> >> >>
> >> >> >
> >> >> > I think this proposal is a lot of complexity to avoid coming up with a
> >> >> > new name for syncable things as they are added to Git.  A path based
> >> >> > mechanism makes it hard to document for the (advanced) user what the
> >> >> > full set of things is and how it might change from release to release.
> >> >> > I think the current core.fsync scheme is a bit easier to understand,
> >> >> > query, and extend.
> >> >>
> >> >> We document it in gitrepository-layout(5). Yeah it has some
> >> >> disadvantages, but one advantage is that you could make the
> >> >> composability easy. I.e. if last exclude wins then a setting of:
> >> >>
> >> >>     core.fsync = ":!*"
> >> >>     core.fsync = "objects/**"
> >> >>
> >> >> Would reset all previous matches & only match objects/**.
> >> >>
> >> >
> >> > The value of changing this is predicated on taking your previous
> >> > multi-valued config proposal, which I'm still not at all convinced
> >> > about.
> >>
> >> They're orthagonal. I.e. you get benefits from multi-value with or
> >> without this globbing mechanism.
> >>
> >> In any case, I don't feel strongly about/am really advocating this
> >> globbing mechanism. I just wondered if it wouldn't make things simpler
> >> since it would sidestep the need to create any sort of categories for
> >> subsets of gitrepository-layout(5), but maybe not...
> >>
> >> > The schema in the current (v1-v2) version of the patch already
> >> > includes an example of extending the list of syncable things, and
> >> > Patrick Steinhardt made it clear that he feels comfortable adding
> >> > 'refs' to the same schema in a future change.
> >> >
> >> > I'll also emphasize that we're talking about a non-functional,
> >> > relatively corner-case behavioral configuration.  These values don't
> >> > change how git's interface behaves except when the system crashes
> >> > during a git command or shortly after one completes.
> >>
> >> That may be something some OS's promise, but it's not something fsync()
> >> or POSIX promises. I.e. you might write a ref, but unless you fsync and
> >> the relevant dir entries another process might not see the update, crash
> >> or not.
> >>
> >
> > I haven't seen any indication that POSIX requires an fsync for
> > visiblity within a running system.  I looked at the doc for open,
> > write, and fsync, and saw no indication that it's posix compliant to
> > require an fsync for visibility.  I think an OS that required fsync
> > for cross-process visiblity would fail to run Git for a myriad of
> > other reasons and would likely lose all its users.  I'm curious where
> > you've seen documentation that allows such unhelpful behavior?
>
> There's multiple unrelated and related things in this area. One is a
> case where you'll e.g. write a file "foo" using stdio, spawn a program
> to work on it in the same program, but it might not see it at all, or
> see empty content, the latter being because you haven't flushed your I/O
> buffers (which you can do via fsync()).
>

For stdio you need to use fflush(3), which just flushes the C
runtime's internal buffers.  You need to do the following to do a full
durable write using stdio:
```
FILE *fp;
...
fflush(fp);
fsync(fileno(fp))
```

> The former is that on *nix systems you're generally only guaranteed to
> write to a fd, but not to have the associated metadata be synced for
> you.
>
> That is spelled out e.g. in the DESCRIPTION section of linux's fsync()
> manpage: https://man7.org/linux/man-pages/man2/fdatasync.2.html
>
> I don't know how much you follow non-Windows FS development, but there
> was also a very well known "incident" early in ext4 where it leaned into
> some permissive-by-POSIX behavior that caused data loss in practice on
> buggy programs that didn't correctly use fsync() , since various tooling
> had come to expect the stricter behavior of ext3:
> https://lwn.net/Articles/328363/
>
> That was explicitly in the area of fs metadata being discussed here.
>
> Generally you can expect your VFS layer to be forgiving when it comes to
> IPC, but even that is out the window when it comes to networked
> filesystems, e.g. a shared git repository hosted on NFS.
>

Everything in the fsync(2) DESCRIPTION section is about what data and
metadata reaches the disk (versus just being cached in-memory).  I've
become a bit familiar with the ext3 vs ext4 (and delayed alloc)
behavior while researching this feature.  These behaviors are all
around the durability you get in the case of kernel crash,
power-failure, or other forms of dirty dismount.

NFS is a complex story.  I'm not intimately familiar with its
particular pitfalls, but from looking at the Linux NFS faq
(http://nfs.sourceforge.net/), it appears that a given single NFS
client will remain coherent with itself. Multiple NFS clients
accessing a single Git repo concurrently are probably going to see
some inconsistency.  In that kind of case, fsync would help, perhaps,
since it would force NFS clients to issue a COMMIT command to the
server.

> >> That's an aside from these config design questions, and I think
> >> most/(all?) OS's anyone cares about these days tend to make that
> >> implicit promise as part of their VFS behavior, but we should probably
> >> design such an interface to fsync() with such pedantic portability in
> >> mind.
> >
> > Why? To be rude to such a hypothetical system, if a system were so
> > insanely designed, it would be nuts to support it.
>
> Because we know that right now the system calls we're invoking aren't
> guaranteed to store data persistently to disk portably, although they do
> so in practice on most modern OS's.
>
> We're portably to a lot of platforms, and also need to keep e.g. NFS in
> mind, so being able to ask for a pedantic mode when you care about data
> retention at the cost of performance would be nice.
>
> And because the fsync config mode you're proposing is thoroughly
> non-standard, but is known to me much faster by leaning into known
> attributes of specific FS's on specific OS's, if we're not running on
> those it would be sensible to fall back to a stricter mode of
> operation. E.g. syncing all 100 loose objects we just wrote, not just
> the last one.
>
> >> > While you may not personally love the proposed configuration
> >> > interface, I'd want your view on some questions:
> >> > 1. Is it easy for the (advanced) user to set a configuration?
> >> > 2. Is it easy for the (advanced) user to see what was configured?
> >> > 3. Is it easy for the Git community to build on this as we want to add
> >> > things to the list of things to sync?
> >> >     a) Is there a good best practice configuration so that people can
> >> > avoid losing integrity for new stuff that they are intending to sync.
> >> >     b) If someone has a custom configuration, can that custom
> >> > configuration do something reasonable as they upgrade versions of Git?
> >> >              ** In response to this question, I might see some value
> >> > in adding a 'derived-metadata' aggregate that can be disabled so that
> >> > a custom configuration can exclude those as they change version to
> >> > version.
> >> >     c) Is it too much maintenance overhead to consider how to present
> >> > this configuration knob for any new hashfile or other datafile in the
> >> > git repo?
> >> > 4. Is there a good path forward to change the default syncable set,
> >> > both in git-for-windows and in Git for other platforms?
> >>
> >> I'm not really sure this globbing this is a good idea, as noted above
> >> just a suggestion etc.
> >>
> >> As noted there it just gets you out of the business of re-defining
> >> gitrepository-layout(5), and assuming too much in advance about certain
> >> use-cases.
> >>
> >> E.g. even "refs" might be too broad for some. I don't tend to be I/O
> >> limited, but I could see how someone who would be would care about
> >> refs/heads but not refs/remotes, or want to exclude logs/* but not the
> >> refs updates themselves etc.
> >
> > This use-case is interesting (distinguishing remote refs from local
> > refs).  I think the difficulty of verifying (for even an advanced
> > user) that the right fsyncing is actually happening still puts me on
> > the side of having a carefully curated and documented set of syncable
> > things rather than a file-path-based mechanism.
> >
> > Is this meaningful in the presumably nearby future world of the refsdb
> > backend?  Is that somehow split by remote versus local?
>
> There is the upcoming "reftable" work, but that's probably 2-3 years out
> at the earliest for series production workloads in git.git.
>
> >> >> >> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> >> >> >> > index 857be7826f3..916c55d6ce9 100644
> >> >> >> > --- a/builtin/pack-objects.c
> >> >> >> > +++ b/builtin/pack-objects.c
> >> >> >> > @@ -1204,11 +1204,13 @@ static void write_pack_file(void)
> >> >> >> >                * If so, rewrite it like in fast-import
> >> >> >> >                */
> >> >> >> >               if (pack_to_stdout) {
> >> >> >> > -                     finalize_hashfile(f, hash, CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> >> >> >> > +                     finalize_hashfile(f, hash, FSYNC_COMPONENT_NONE,
> >> >> >> > +                                       CSUM_HASH_IN_STREAM | CSUM_CLOSE);
> >> >> >>
> >> >> >> Not really related to this per-se, but since you're touching the API
> >> >> >> everything goes through I wonder if callers should just always try to
> >> >> >> fsync, and we can just catch EROFS and EINVAL in the wrapper if someone
> >> >> >> tries to flush stdout, or catch the fd at that lower level.
> >> >> >>
> >> >> >> Or maybe there's a good reason for this...
> >> >> >
> >> >> > It's platform dependent, but I'd expect fsync would do something for
> >> >> > pipes or stdout redirected to a file.  In these cases we really don't
> >> >> > want to fsync since we have no idea what we're talking to and we're
> >> >> > potentially worsening performance for probably no benefit.
> >> >>
> >> >> Yeah maybe we should just leave it be.
> >> >>
> >> >> I'd think the C library returning EINVAL would be a trivial performance
> >> >> cost though.
> >> >>
> >> >> It just seemed odd to hardcode assumptions about what can and can't be
> >> >> synced when the POSIX defined function will also tell us that.
> >> >>
> >> >
> >> > Redirecting stdout to a file seems like a common usage for this
> >> > command. That would definitely be fsyncable, but Git has no idea what
> >> > its proper category is since there's no way to know the purpose or
> >> > lifetime of the packfile.  I'm going to leave this be, because I'd
> >> > posit that "can it be fsynced?" is not the same as "should it be
> >> > fsynced?".  The latter question can't be answered for stdout.
> >>
> >> As noted this was just an aside, and I don't even know if any OS would
> >> do anything meaningful with an fsync() of such a FD anyway.
> >>
> >
> > The underlying fsync primitive does have a meaning on Windows for
> > pipes, but it's certainly not what Git would want to do. Also if
> > stdout is redirected to a file, I'm pretty sure that UNIX OSes would
> > respect the fsync call.  However it's not meaningful in the sense of
> > the git repository, since we don't know what the packfile is or why it
> > was created.
>
> I suggested that because I think it's probably nonsensical, but it's
> nonsense that POSIX seems to explicitly tell us that it'll handle
> (probably by silently doing nothing). So in terms of our interface we
> could lean into that and avoid our own special-casing.
>
> >> I just don't see why we wouldn't say:
> >>
> >>  1. We're syncing this category of thing
> >>  2. Try it
> >>  3. If fsync returns "can't fsync that sort of thing" move on
> >>
> >> As opposed to trying to shortcut #3 by doing the detection ourselves.
> >>
> >> I.e. maybe there was a good reason, but it seemed to be some easy
> >> potential win for more simplification, since you were re-doing and
> >> simplifying some of the interface anyway...
> >
> > We're trying to be deliberate about what we're fsyncing.  Fsyncing an
> > unknown file created by the packfile code doesn't move us in that
> > direction.  In your taxonomy we don't know (1), "what is this category
> > of thing?"  Sure it's got the packfile format, but is not known to be
> > an actual packfile that's part of the repository.
>
> We know it's a fd, isn't that sufficient? In any case, I'm fine with
> also keeping it as is, I don't mean to split hairs here.
>
> It just stuck out as an odd part of the interface, why treat some fd's
> specially, instead of just throwing it all at the OS. Presumably the
> first thing the OS will do is figure out if it's a syncable fd or not,
> and act appropriately.
>

I'll put the following comment in pack-objects.c:
/*
* We never fsync when writing to stdout since we may
* not be writing to a specific file. For instance, the
* upload-pack code passes a pipe here. Calling fsync
* on a pipe results in unnecessary synchronization with
* the reader on some platforms.
*/

> >> >>
> >> >> >> > [...]
> >> >> >> > +/*
> >> >> >> > + * These values are used to help identify parts of a repository to fsync.
> >> >> >> > + * FSYNC_COMPONENT_NONE identifies data that will not be a persistent part of the
> >> >> >> > + * repository and so shouldn't be fsynced.
> >> >> >> > + */
> >> >> >> > +enum fsync_component {
> >> >> >> > +     FSYNC_COMPONENT_NONE                    = 0,
> >> >> >>
> >> >> >> I haven't read ahead much but in most other such cases we don't define
> >> >> >> the "= 0", just start at 1<<0, then check the flags elsewhere...
> >> >> >>
> >> >> >> > +static const struct fsync_component_entry {
> >> >> >> > +     const char *name;
> >> >> >> > +     enum fsync_component component_bits;
> >> >> >> > +} fsync_component_table[] = {
> >> >> >> > +     { "loose-object", FSYNC_COMPONENT_LOOSE_OBJECT },
> >> >> >> > +     { "pack", FSYNC_COMPONENT_PACK },
> >> >> >> > +     { "pack-metadata", FSYNC_COMPONENT_PACK_METADATA },
> >> >> >> > +     { "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
> >> >> >> > +     { "objects", FSYNC_COMPONENTS_OBJECTS },
> >> >> >> > +     { "default", FSYNC_COMPONENTS_DEFAULT },
> >> >> >> > +     { "all", FSYNC_COMPONENTS_ALL },
> >> >> >> > +};
> >> >> >> > +
> >> >> >> > +static enum fsync_component parse_fsync_components(const char *var, const char *string)
> >> >> >> > +{
> >> >> >> > +     enum fsync_component output = 0;
> >> >> >> > +
> >> >> >> > +     if (!strcmp(string, "none"))
> >> >> >> > +             return output;
> >> >> >> > +
> >> >> >> > +     while (string) {
> >> >> >> > +             int i;
> >> >> >> > +             size_t len;
> >> >> >> > +             const char *ep;
> >> >> >> > +             int negated = 0;
> >> >> >> > +             int found = 0;
> >> >> >> > +
> >> >> >> > +             string = string + strspn(string, ", \t\n\r");
> >> >> >>
> >> >> >> Aside from the "use a list" isn't this hardcoding some windows-specific
> >> >> >> assumptions with \n\r? Maybe not...
> >> >> >
> >> >> > I shamelessly stole this code from parse_whitespace_rule. I thought
> >> >> > about making a helper to be called by both functions, but the amount
> >> >> > of state going into and out of the wrapper via arguments was
> >> >> > substantial and seemed to negate the benefit of deduplication.
> >> >>
> >> >> FWIW string_list_split() is easier to work with in those cases, or at
> >> >> least I think so...
> >> >
> >> > This code runs at startup for a variable that may be present on some
> >> > installations.  The nice property of the current patch's code is that
> >> > it's already a well-tested pattern that doesn't do any allocations as
> >> > it's working, unlike string_list_split().
> >>
> >> Multi-value config would also get you fewer allocations :)
> >>
> >> Anyway, I mainly meant to point out that for stuff like this it's fine
> >> to optimize it for ease rather than micro-optimize allocations. Those
> >> really aren't a bottleneck on this scale.
> >>
> >> Even in that case there's string_list_split_in_place(), which can be a
> >> bit nicer than manual C-string fiddling.
> >>
> >
> > Am I allowed to change the config value string in place? The
> > core.whitespace code is careful not to modify the string. I kind of
> > like the parse_ws_error_highlight code a little better now that I've
> > seen it, but I think the current code is fine too.
>
> I don't remember offhand if that's safe, probably not. So you'll need a
> copy here.
>
> >> > I hope you know that I appreciate your review feedback, even though
> >> > I'm pushing back on most of it so far this round. I'll be sending v3
> >> > to the list soon after giving it another look over.
> >>
> >> Sure, no worries. Just hoping to help. If you go for something different
> >> etc. that's fine. Just hoping to bridge the gap in some knowledge /
> >> offer potentially interesting suggestions (some of which may be dead
> >> ends, like the config glob thing...).

Thanks again for the review, I'll send an updated PR soon.

Thanks,
Neeraj

@neerajsi-msft neerajsi-msft force-pushed the ns/core-fsync branch 2 times, most recently from afaaf12 to 2207950 Compare December 8, 2021 00:15
@gitgitgadget
Copy link

gitgitgadget bot commented Dec 8, 2021

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Tue, Dec 7, 2021 at 3:57 AM Patrick Steinhardt <ps@pks.im> wrote:

> While I bail from the question of whether we want to grant as much
> configurability to the user as this patch series does, I quite like the
> implementation. It feels rather straight-forward and it's easy to see
> how to extend it to support syncing of other subsystems like the loose
> refs.
>
> Thanks!
>
> Patrick

Thanks for the positive comment.  I'm assuming that a major Git
services like GitHub or GitLab would be able to take advantage of the
granular options and knowledge of their hosting environment to choose
the right values for any server-side git deployments.  I'd probably
turn off syncing for derived stuff like the commit-graph file and pack
metadata.

My underlying interest in all of these changes is to make Windows stop
looking so bad (we're defaulting to core.fsyncobjectfiles=true).
Batch mode should give similar safety and much more optimizable
performance in our environment.

Thanks,
Neeraj

@neerajsi-msft
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Dec 9, 2021

Submitted as pull.1093.v3.git.1639011433.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-1093/neerajsi-msft/ns/core-fsync-v3

To fetch this version to local tag pr-1093/neerajsi-msft/ns/core-fsync-v3:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-1093/neerajsi-msft/ns/core-fsync-v3

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 11, 2022

On the Git mailing list, Neeraj Singh wrote (reply to this):

Hello Everyone,
I wanted to revive this thread in the new year.

To summarize the current state of affairs:
* The current fsync patch series implements two new configuration options:
       core.fsync = <comma-separate list> -- select which repo
components will be fsynced
       core.fsyncMethod = fsync|writeout-only  -- select what form of
fsyncing will be done

* This patch series now ignores core.fsyncObjectFiles with a
deprecation warning pointing the user at core.fsync.

* There is a follow-on series that will extend the core.fsyncMethod to
also include a `batch` mode that speeds up bulk operations by avoiding
repeated disk cache flushes.

* I developed the current mechanism after Ævar pointed out that the
original `core.fsyncObjectFiles=batch` change would cause older
versions of Git to die() when exposed to a new configuration. There
were also several fsync changes floating around, including Patrick
Steinhardts `core.fsyncRefFiles` change [1] and Eric Wong's
`core.fsync = false` change [2].

* The biggest sticking points are in [3].  The fundamental
disagreement is about whether core.fsync should look like:
      A) core.fsync = objects,commit-graph   [current patch implementation]
      or
      B) core.fsync = objects
          core.fsync = commit-graph    [Ævar's multivalued proposal].
I prefer sticking with (A) for reasons spelled out in the thread. I'm
happy to re-litigate this discussion though.

* There's also a sticking point about whether we should fsync when
invoking pack-objects against stdout.  I think that mostly reflects a
missing comment in the code rather than a real disagreement.

* Now that ew/test-wo-fsync has been integrated, there's some
redundancy between core.fsync=none and Eric's patch.

Open questions:
1) What format should we use for the core.fsync configuration to
select individual repo components to sync?
2) Are we okay with deprecating core.fsyncObjectFiles in a single
release with a warning?
3) Is it reasonable to expect people adding new persistent files to
add and document new values of the core.fsync settings?

Thanks,
Neeraj

[1]  https://lore.kernel.org/git/20211030103950.M489266@dcvr/
[2] https://lore.kernel.org/git/20211028002102.19384-1-e@80x24.org/
[3] https://lore.kernel.org/git/211207.86wnkgo9fv.gmgdl@evledraar.gmail.com/

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 11, 2022

On the Git mailing list, wrote (reply to this):

On January 7, 2022 8:14 PM, Neeraj Singh wrote:
> Hello Everyone,
> I wanted to revive this thread in the new year.
> 
> To summarize the current state of affairs:
> * The current fsync patch series implements two new configuration options:
>        core.fsync = <comma-separate list> -- select which repo components will
> be fsynced
>        core.fsyncMethod = fsync|writeout-only  -- select what form of fsyncing
> will be done
> 
> * This patch series now ignores core.fsyncObjectFiles with a deprecation
> warning pointing the user at core.fsync.
> 
> * There is a follow-on series that will extend the core.fsyncMethod to also
> include a `batch` mode that speeds up bulk operations by avoiding repeated
> disk cache flushes.
> 
> * I developed the current mechanism after Ævar pointed out that the original
> `core.fsyncObjectFiles=batch` change would cause older versions of Git to
> die() when exposed to a new configuration. There were also several fsync
> changes floating around, including Patrick Steinhardts `core.fsyncRefFiles`
> change [1] and Eric Wong's `core.fsync = false` change [2].
> 
> * The biggest sticking points are in [3].  The fundamental disagreement is
> about whether core.fsync should look like:
>       A) core.fsync = objects,commit-graph   [current patch implementation]
>       or
>       B) core.fsync = objects
>           core.fsync = commit-graph    [Ævar's multivalued proposal].
> I prefer sticking with (A) for reasons spelled out in the thread. I'm happy to re-
> litigate this discussion though.
> 
> * There's also a sticking point about whether we should fsync when invoking
> pack-objects against stdout.  I think that mostly reflects a missing comment in
> the code rather than a real disagreement.
> 
> * Now that ew/test-wo-fsync has been integrated, there's some redundancy
> between core.fsync=none and Eric's patch.
> 
> Open questions:
> 1) What format should we use for the core.fsync configuration to select
> individual repo components to sync?
> 2) Are we okay with deprecating core.fsyncObjectFiles in a single release with
> a warning?
> 3) Is it reasonable to expect people adding new persistent files to add and
> document new values of the core.fsync settings?
> 
> Thanks,
> Neeraj
> 
> [1]  https://lore.kernel.org/git/20211030103950.M489266@dcvr/
> [2] https://lore.kernel.org/git/20211028002102.19384-1-e@80x24.org/
> [3]
> https://lore.kernel.org/git/211207.86wnkgo9fv.gmgdl@evledraar.gmail.com/

Neeraj,

Please remember that fsync() is operating system and version specific. You cannot make any assumptions about what is supported and what is not. I have recently had issues with git built on a recent operating system not running on a version from 2020. The proposed patches do not work, as I recall, in a portable manner, so caution is required making this change. You can expect this not to work on some platforms and some versions. Please account for that. Requiring users who are not aware of OS details to configure git to function at all is a bad move, in my view - which has not changed since last time.

Thanks,
Randall

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 11, 2022

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Sat, Jan 8, 2022 at 4:55 PM <rsbecker@nexbridge.com> wrote:
>
> Please remember that fsync() is operating system and version specific. You cannot make any assumptions about what is supported and what is not. I have recently had issues with git built on a recent operating system not running on a version from 2020. The proposed patches do not work, as I recall, in a portable manner, so caution is required making this change. You can expect this not to work on some platforms and some versions. Please account for that. Requiring users who are not aware of OS details to configure git to function at all is a bad move, in my view - which has not changed since last time.
>

There was already an implied configuration of fsync in the Git
codebase.  None of the defaults are changing--assuming that a user
does not explicitly configure the core.fsync setting, Git should work
the same as it always has.  I don't believe the current patch series
introduces any new incompatibilities.

Thanks,
Neeraj

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 13, 2022

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

> Hmph, this seems to make one test fail.
>
> t5801-remote-helpers.sh (Wstat: 256 Tests: 31 Failed: 4)
>   Failed tests:  14-16, 31
>     Non-zero exit status: 1
> Files=1, Tests=31,  2 wallclock secs ( 0.04 usr  0.00 sys + 1.40 cusr  1.62 csys =  3.06 CPU)
> Result: FAIL

False alarm.  This byitself, or merged to 'seen' with other random
topics, no longer seem to break these tests.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 14, 2022

This patch series was integrated into seen via git@7637c4f.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 14, 2022

There was a status update in the "New Topics" section about the branch ns/core-fsyncmethod on the Git mailing list:

Replace core.fsyncObjectFiles with two new configuration variables,
core.fsync and core.fsyncMethod.

Will merge to 'next'?
source: <pull.1093.v6.git.1646952204.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 15, 2022

This patch series was integrated into seen via git@f0cde07.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 16, 2022

This patch series was integrated into seen via git@87f3a13.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 16, 2022

This patch series was integrated into seen via git@fd008b1.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 16, 2022

This patch series was integrated into seen via git@ce52388.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2022

This patch series was integrated into seen via git@28378d6.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2022

There was a status update in the "Cooking" section about the branch ns/core-fsyncmethod on the Git mailing list:

Replace core.fsyncObjectFiles with two new configuration variables,
core.fsync and core.fsyncMethod.

Will merge to 'next'.
source: <pull.1093.v6.git.1646952204.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2022

This patch series was integrated into seen via git@7452d44.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2022

This patch series was integrated into seen via git@a976459.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2022

This patch series was integrated into next via git@c8a52f8.

@gitgitgadget gitgitgadget bot added the next label Mar 17, 2022
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 18, 2022

This patch series was integrated into seen via git@3901431.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2022

This patch series was integrated into seen via git@6259956.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 22, 2022

This patch series was integrated into seen via git@c9b2e38.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 23, 2022

This patch series was integrated into seen via git@79e7880.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 24, 2022

There was a status update in the "Cooking" section about the branch ns/core-fsyncmethod on the Git mailing list:

Replace core.fsyncObjectFiles with two new configuration variables,
core.fsync and core.fsyncMethod.

Will merge to 'master'.
source: <pull.1093.v6.git.1646952204.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 24, 2022

This patch series was integrated into seen via git@4f5f388.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 25, 2022

This patch series was integrated into seen via git@615ae0c.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 25, 2022

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Fri, Mar 11, 2022 at 10:58:59AM +0100, Patrick Steinhardt wrote:
> When writing both loose and packed references to disk we first create a
> lockfile, write the updated values into that lockfile, and on commit we
> rename the file into place. According to filesystem developers, this
> behaviour is broken because applications should always sync data to disk
> before doing the final rename to ensure data consistency [1][2][3]. If
> applications fail to do this correctly, a hard crash of the machine can
> easily result in corrupted on-disk data.
> 
> This kind of corruption can in fact be easily observed with Git when the
> machine hard-resets shortly after writing references to disk. On
> machines with ext4, this will likely lead to the "empty files" problem:
> the file has been renamed, but its data has not been synced to disk. The
> result is that the reference is corrupt, and in the worst case this can
> lead to data loss.
> 
> Implement a new option to harden references so that users and admins can
> avoid this scenario by syncing locked loose and packed references to
> disk before we rename them into place.

In 't5541-http-push-smart.sh' there is a test case called 'push 2000
tags over http', which does pretty much what it's title says.  This
patch makes that test case significantly slower.


diff --git a/t/t5541-http-push-smart.sh b/t/t5541-http-push-smart.sh
index 8ca50f8b18..d7e94cb791 100755
--- a/t/t5541-http-push-smart.sh
+++ b/t/t5541-http-push-smart.sh
@@ -415,7 +415,7 @@ test_expect_success CMDLINE_LIMIT 'push 2000 tags over http' '
 	  sort |
 	  sed "s|.*|$sha1 refs/tags/really-long-tag-name-&|" \
 	  >.git/packed-refs &&
-	run_with_limited_cmdline git push --mirror
+	run_with_limited_cmdline /usr/bin/time git push --mirror
 '
 
 test_expect_success GPG 'push with post-receive to inspect certificate' '

Before this patch (bc22d845c4^) 'time' reports:

  3.62user 0.03system 0:03.83elapsed 95%CPU (0avgtext+0avgdata 11904maxresident)k
  0inputs+312outputs (0major+4597minor)pagefaults 0swaps

With this patch (bc22d845c4):

  3.56user 0.04system 0:33.60elapsed 10%CPU (0avgtext+0avgdata 11832maxresident)k
  0inputs+320outputs (0major+4578minor)pagefaults 0swaps

And the total runtime of the whole test script increases from 8-9s to
37-39s.


I wonder whether we should relax the fsync options for this test case.

> 
> [1]: https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/
> [2]: https://btrfs.wiki.kernel.org/index.php/FAQ (What are the crash guarantees of overwrite-by-rename)
> [3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/ext4.rst (see auto_da_alloc)
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> 
> Hi,
> 
> here's my updated patch series which implements syncing of refs. It
> applies on top of Neeraj's v6 of "A design for future-proofing fsync()
> configuration".
> 
> I've simplified these patches a bit:
> 
>     - I don't distinguishing between "loose" and "packed" refs anymore.
>       I agree with Junio that it's probably not worth it, but we can
>       still reintroduce the split at a later point without breaking
>       backwards compatibility if the need comes up.
> 
>     - I've simplified the way loose refs are written to disk so that we
>       now sync them when before we close their files. The previous
>       implementation I had was broken because we tried to sync after
>       closing.
> 
> Because this really only changes a few lines of code I've also decided
> to squash together the patches into a single one.
> 
> Patrick
> 
>  Documentation/config/core.txt | 1 +
>  cache.h                       | 7 +++++--
>  config.c                      | 1 +
>  refs/files-backend.c          | 1 +
>  refs/packed-backend.c         | 3 ++-
>  5 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> index 37105a7be4..812cca7de7 100644
> --- a/Documentation/config/core.txt
> +++ b/Documentation/config/core.txt
> @@ -575,6 +575,7 @@ but risks losing recent work in the event of an unclean system shutdown.
>  * `index` hardens the index when it is modified.
>  * `objects` is an aggregate option that is equivalent to
>    `loose-object,pack`.
> +* `reference` hardens references modified in the repo.
>  * `derived-metadata` is an aggregate option that is equivalent to
>    `pack-metadata,commit-graph`.
>  * `committed` is an aggregate option that is currently equivalent to
> diff --git a/cache.h b/cache.h
> index cde0900d05..033e5b0779 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1005,6 +1005,7 @@ enum fsync_component {
>  	FSYNC_COMPONENT_PACK_METADATA		= 1 << 2,
>  	FSYNC_COMPONENT_COMMIT_GRAPH		= 1 << 3,
>  	FSYNC_COMPONENT_INDEX			= 1 << 4,
> +	FSYNC_COMPONENT_REFERENCE		= 1 << 5,
>  };
>  
>  #define FSYNC_COMPONENTS_OBJECTS (FSYNC_COMPONENT_LOOSE_OBJECT | \
> @@ -1017,7 +1018,8 @@ enum fsync_component {
>  				  FSYNC_COMPONENTS_DERIVED_METADATA | \
>  				  ~FSYNC_COMPONENT_LOOSE_OBJECT)
>  
> -#define FSYNC_COMPONENTS_COMMITTED (FSYNC_COMPONENTS_OBJECTS)
> +#define FSYNC_COMPONENTS_COMMITTED (FSYNC_COMPONENTS_OBJECTS | \
> +				    FSYNC_COMPONENT_REFERENCE)
>  
>  #define FSYNC_COMPONENTS_ADDED (FSYNC_COMPONENTS_COMMITTED | \
>  				FSYNC_COMPONENT_INDEX)
> @@ -1026,7 +1028,8 @@ enum fsync_component {
>  			      FSYNC_COMPONENT_PACK | \
>  			      FSYNC_COMPONENT_PACK_METADATA | \
>  			      FSYNC_COMPONENT_COMMIT_GRAPH | \
> -			      FSYNC_COMPONENT_INDEX)
> +			      FSYNC_COMPONENT_INDEX | \
> +			      FSYNC_COMPONENT_REFERENCE)
>  
>  /*
>   * A bitmask indicating which components of the repo should be fsynced.
> diff --git a/config.c b/config.c
> index eb75f65338..3c9b6b589a 100644
> --- a/config.c
> +++ b/config.c
> @@ -1333,6 +1333,7 @@ static const struct fsync_component_name {
>  	{ "commit-graph", FSYNC_COMPONENT_COMMIT_GRAPH },
>  	{ "index", FSYNC_COMPONENT_INDEX },
>  	{ "objects", FSYNC_COMPONENTS_OBJECTS },
> +	{ "reference", FSYNC_COMPONENT_REFERENCE },
>  	{ "derived-metadata", FSYNC_COMPONENTS_DERIVED_METADATA },
>  	{ "committed", FSYNC_COMPONENTS_COMMITTED },
>  	{ "added", FSYNC_COMPONENTS_ADDED },
> diff --git a/refs/files-backend.c b/refs/files-backend.c
> index f59589d6cc..6521ee8af5 100644
> --- a/refs/files-backend.c
> +++ b/refs/files-backend.c
> @@ -1787,6 +1787,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
>  	fd = get_lock_file_fd(&lock->lk);
>  	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
>  	    write_in_full(fd, &term, 1) < 0 ||
> +	    fsync_component(FSYNC_COMPONENT_REFERENCE, get_lock_file_fd(&lock->lk)) < 0 ||
>  	    close_ref_gently(lock) < 0) {
>  		strbuf_addf(err,
>  			    "couldn't write '%s'", get_lock_file_path(&lock->lk));
> diff --git a/refs/packed-backend.c b/refs/packed-backend.c
> index 27dd8c3922..9d704ccd3e 100644
> --- a/refs/packed-backend.c
> +++ b/refs/packed-backend.c
> @@ -1262,7 +1262,8 @@ static int write_with_updates(struct packed_ref_store *refs,
>  		goto error;
>  	}
>  
> -	if (close_tempfile_gently(refs->tempfile)) {
> +	if (fsync_component(FSYNC_COMPONENT_REFERENCE, get_tempfile_fd(refs->tempfile)) ||
> +	    close_tempfile_gently(refs->tempfile)) {
>  		strbuf_addf(err, "error closing file %s: %s",
>  			    get_tempfile_path(refs->tempfile),
>  			    strerror(errno));
> -- 
> 2.35.1
> 


@gitgitgadget
Copy link

gitgitgadget bot commented Mar 25, 2022

User SZEDER Gábor <szeder.dev@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 26, 2022

This patch series was integrated into seen via git@eb804cd.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 26, 2022

This patch series was integrated into master via git@eb804cd.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 26, 2022

This patch series was integrated into next via git@eb804cd.

@gitgitgadget gitgitgadget bot added the master label Mar 26, 2022
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 26, 2022

Closed via eb804cd.

@gitgitgadget gitgitgadget bot closed this Mar 26, 2022
@@ -547,13 +547,23 @@ core.whitespace::
is relevant for `indent-with-non-tab` and when Git fixes `tab-in-indent`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jiang Xin wrote (reply to this):

On Sat, Mar 12, 2022 at 6:25 AM Neeraj Singh via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> @@ -1613,6 +1687,8 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
>         }
>
>         if (!strcmp(var, "core.fsyncobjectfiles")) {
> +               if (fsync_object_files < 0)
> +                       warning(_("core.fsyncobjectfiles is deprecated; use core.fsync instead"));

s/core.fsyncobjectfiles/core.fsyncObjectFiles/  to use bumpyCaps for
config variable in documentation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Neeraj Singh wrote (reply to this):

On Mon, Mar 28, 2022 at 4:07 AM Jiang Xin <worldhello.net@gmail.com> wrote:
>
> On Sat, Mar 12, 2022 at 6:25 AM Neeraj Singh via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> > @@ -1613,6 +1687,8 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
> >         }
> >
> >         if (!strcmp(var, "core.fsyncobjectfiles")) {
> > +               if (fsync_object_files < 0)
> > +                       warning(_("core.fsyncobjectfiles is deprecated; use core.fsync instead"));
>
> s/core.fsyncobjectfiles/core.fsyncObjectFiles/  to use bumpyCaps for
> config variable in documentation.

THanks for pointing this out.  I'll fix it.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 28, 2022

User Jiang Xin <worldhello.net@gmail.com> has been added to the cc: list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant