Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer to run CI/PR builds in Azure Pipelines #31

Closed
wants to merge 21 commits into from

Conversation

dscho
Copy link
Member

@dscho dscho commented Aug 31, 2018

For a long time already, we have tested Git's source code continuously via Travis CI, see e.g. https://travis-ci.org/git/git/builds/421738884. It has served us well, and more and more developers actually pay attention and benefit from the testing this gives us.

It is also an invaluable tool for contributors who can validate their code contributions via PRs on GitHub, e.g. to verify that their tests do actually run on macOS (i.e. with the BSD family of Unix tools instead of the GNU one).

The one sad part about this is the Windows support. Travis used to lack it (Travis announced some early support for Windows, hot off the press), and we work around that by using Azure Pipelines (the CI part of Azure DevOps, formerly known as Visual Studio Team Services) indirectly: one phase in Travis would trigger a build, wait for its log, and then paste that log.

As Git's Windows builds (and tests!) take quite a bit of time, Travis often timed out, or somehow the trigger did not work, and for security reasons (the Windows builds are performed in a private pool of containers), the Windows builds are completely disabled for Pull Requests on GitHub.

One might ask why we did not use Azure Pipelines directly. There were a couple of reasons for that:

  • most notably, Azure Pipelines' build logs could not be viewed anonymously,
  • while Azure Pipelines had Linux and Windows agents, it lacked macOS agents,
  • etc

The main two reasons no longer apply: macOS agents are available now, as are public projects, i.e. it is possible to configure a Azure Pipelines project so that anybody can view the logs. Since I offered v1, Azure Pipelines has been made available via the GitHub Marketplace, free of cost for open source projects.

I had secured such a public project for Git for Windows already, and I also got one for Git. For now, the latter is hooked up with my personal git.git fork on GitHub, but it is my hope that I convince y'all that these Azure Pipelines builds are a good idea, and then hook it up with https://github.com/git/git.

As a special treat, this patch series adds the ability to present the outcome of Git's test suite as JUnit-style .xml files. This allows the Azure Pipelines build to present fun diagrams, trends, and makes it a lot easier to drill down to test failures than before. See for example https://dev.azure.com/git/git/_build/results?buildId=113&view=ms.vss-test-web.test-result-details (you can click on the label of the failed test, and then see the detailed output in the right pane).

But maybe you're not interested as much in the presentation of test failures as in the speed? Well, I got you covered with that, too. As of v3, the test suite is run in parallel on Windows, cutting the overall run time to ~33 minutes (see one of the latest builds, for example).

This patch series took waaaaaaaay more time than I had originally anticipated, but I think that in particular the advanced display of the test results and the reduction of the overall run time was worth it. Please let me know what you think about this.

Changes since v5:

  • Fixed sparse warnings.
  • Reworded the commit message of "t0061: fix for --with-dashes and RUN TIME_PREFIX" to clarify why there is a warning message in the first place, why this cannot be fixed, and that we introduce a simple workaround here.

Changes since v4:

  • Clarified in the commit message why BREW_INSTALL_PACKAGES is needed.
  • Added a clause to error out if we could not detect the CI type.
  • Introduced a CI_TYPE variable to guard CI type-specific code.
  • MAKEFLAGS is now actually configured in the CI type-specific section.
  • Moved the inadvertent fixup for GIT_TEST_OPTS from "Add a build definition for Azure DevOps" to "ci/lib.sh: add support for Azure Pipelines".
  • Backed out the Windows support from the commit adding the initial Azure Pipelines support, instead adding it later to help the flow of the patch series.
  • Clarified in "test-date: add a subcommand to measure times in shell scripts"'s commit message that one of the goals is portability.
  • Moved all of those CI type-specific definitions in ci/lib.sh after the set -ex statement (as suggested by Gábor a loooong time ago already).
  • Guarded the Travis-specific way to "publish" trash directories of failed tests via base64-encoded tar files in the log; For Azure Pipelines, these are published as build artifacts instead.

Changes since v3:

  • Fixed the incorrect usage of cut -c that resulted in an empty trace when failed tests were published (except if the very first test case in a test script failed, then it was correct by mistake).
  • Excluded the previous test case's "ok" (or "not ok") line from the trace published with failed tests.
  • Renamed TRAVIS_BRANCH to CI_BRANCH already in the first commit, as we should not override TRAVIS_BRANCH with something that it is not.
  • Rebased onto current master to avoid merge conflicts with the recently-merged sg/stress-test branch (no, Junio, I really trust myself more than you to resolve those merge conflicts).

Changes since v2:

  • Removed left-over debugging code that would skip a good chunk of t0000-init.sh.
  • Fixed the URL of the build badge.
  • Removed a trailing empty line from, and added a missing closing pointy bracket to, ci/mount-fileshare.sh.
  • Moved the "travis: fix skipping tagged releases" commit up to the beginning of the patch series.
  • The commit message of "ci/lib.sh: add support for Azure Pipelines" now mentions explicitly that the Homebrew packages that need to be installed on Travis' macOS agents are already installed on Azure Pipelines'.
  • Some commands were not guarded by || exit 1, i.e. if they would fail, the build would not have failed.
  • We now install gcc-8 for the linux-gcc job.
  • We no longer try to re-run failed tests with verbose log. Instead, we simply use the verbose log to begin with. Tests showed that it had a negligible impact on the overall run time.
  • The test_atexit_handler function was scratched; It would be the right thing to do, but is actually an independent topic (it was only implemented in v2 to accommodate the "re-run with verbose log on failure" trick)
  • We now use a new YAML schema (most notably, "phase" is now known as "job")
  • The Windows job contained PowerShell sections that were indented with 3 spaces instead of 2.
  • The Windows job is now structured better, by separating different concerns into different "tasks" so that it is easier to see what exactly failed (was it the build? was it the test?)
  • The Windows job was split into a job to build Git and 10 parallel jobs to run the test suite with the artifacts built by the first job. This reduces the overall run time from ~1h20 (which was the run time by the Windows job) to ~35 minutes (which is the run time of the linux-gcc job).
  • The JUnit XML is now written using a test helper rather than a fragile and probably not even portable sed call.
  • Since we needed to determine the file size of the verbose log (to cut out individual test cases' log output), we now introduce a test helper to do that, and use it throughout the test suite (where Perl was used before).
  • It would appear that a recent change either in Cygwin or in the Azure VMs causes problems sporadically where the trash directories cannot be removed, but a subsequent rm will succeed. We now simply do that, because it won't harm the common case (where the first rm succeeds already) and because it helps the Windows job succeed pretty reliably.

Changes since v1:

  • Removed a superfluous eval.
  • Added the commit that fixes the Travis PR builds targeting master that just happens to be tagged (see https://travis-ci.org/git/git/jobs/424276413 for an incorrectly-skipped build).
  • The commit messages and the cover letter now reflect the name change from Visual Studio Team Services to Azure DevOps (and in particular, Azure Pipelines for the automated builds).
  • Now we're using test_atexit (which we introduced for that purpose) instead of hard-coding kill_p4d and stop_git_daemon.
  • The build should now also succeed for Pull Requests (where secret variables are not available, for security reasons, and as a consequence the file share cannot be mounted).
  • The shell scripted parts now use proper && chains.

cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: SZEDER Gábor szeder.dev@gmail.com

@dscho dscho changed the title Vsts ci Offer to run in Visual Studio Team Services' CI Aug 31, 2018
@dscho dscho changed the title Offer to run in Visual Studio Team Services' CI Offer to run CI/PR builds in Visual Studio Team Services Sep 1, 2018
@dscho dscho force-pushed the vsts-ci branch 9 times, most recently from c7ee7ea to dc1d890 Compare September 3, 2018 19:16
@dscho
Copy link
Member Author

dscho commented Sep 3, 2018

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 3, 2018

Submitted as pull.31.git.gitgitgadget@gmail.com

@derrickstolee derrickstolee self-requested a review September 4, 2018 13:06
@dscho dscho changed the title Offer to run CI/PR builds in Visual Studio Team Services Offer to run CI/PR builds in Azure Pipelines Sep 14, 2018
@dscho dscho force-pushed the vsts-ci branch 2 times, most recently from edd8784 to b96f5fc Compare September 17, 2018 02:46
@dscho
Copy link
Member Author

dscho commented Oct 15, 2018

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2018

Submitted as pull.31.v2.git.gitgitgadget@gmail.com

@dscho
Copy link
Member Author

dscho commented Oct 15, 2018

/allow jamill

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2018

User jamill already allowed to use GitGitGadget.

Copy link
Member Author

dscho commented Nov 2, 2018

This branch is now known as js/vsts-ci.

dscho added 11 commits January 28, 2019 17:06
This will come in handy when publishing the results of Git's test suite
during an automated Azure DevOps run.

Note: we need to make extra sure that invalid UTF-8 encoding is turned
into valid UTF-8 (using the Replacement Character, \uFFFD) because
t9902's trace contains such invalid byte sequences, and the task in the
Azure Pipeline that uploads the test results would refuse to do anything
if it was asked to parse an .xml file with invalid UTF-8 in it.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This patch introduces a conditional arm that defines some environment
variables and a function that displays the URL given the job id (to
identify previous runs for known-good trees).

Because Azure Pipeline's macOS agents already have git-lfs and gettext
installed, we can leave `BREW_INSTALL_PACKAGES` empty (unlike in
Travis' case).

Note: this patch does not introduce an Azure Pipelines definition yet;
That is left for the next patch.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This commit adds an azure-pipelines.yml file which is Azure DevOps'
equivalent to Travis CI's .travis.yml.

The main idea is to replicate the Travis configuration as faithfully as
possible, to make it easy to compare the Azure Pipeline builds to the
Travis ones (spoiler: some parts, especially the macOS jobs, are way
faster in Azure Pileines). Meaning: the number and the order of the jobs
added in this commit faithfully replicates what we have in .travis.yml.

Note: Our .travis.yml configuration has a Windows part that is *not*
replicated in the Azure Pipelines definition. The reason is easy to see:
As Travis cannot support our Windws needs (even with the preliminary
Windows support that was recently added to Travis after waiting for
*years* for that feature, our test suite would simply hit Travis'
timeout every single time).

To make things a bit easier to understand, we refrain from using the
`matrix` feature here because (while it is powerful) it can be a bit
confusing to users who are not familiar with CI setups. Therefore, we
use a separate phase even for similar configurations (such as GCC vs
Clang on Linux, GCC vs Clang on macOS).

Also, we make use of the shiny new feature we just introduced where the
test suite can output JUnit-style .xml files. This information is made
available in a nice UI that allows the viewer to filter by phase and/or
test number, and to see trends such as: number of (failing) tests, time
spent running the test suite, etc. (While this seemingly contradicts the
intention to replicate the Travis configuration as faithfully as
possible, it is just too nice to show off that capability here already.)

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Previously, we did not have robust support for Windows in our CI
definition, simply because Travis cannot accommodate our needs (even
after Travis added experimental Windows support very recently, it takes
longer than Travis' 50 minute timeout to build Git and run the test
suite on Windows). Instead, we used a hack that started a dedicated
Azure Pipeline from Travis and waited for the output, often timing out
(which is quite fragile, as we found out).

With this commit, we finally have first-class support for Windows in our
CI definition (in the Azure Pipelines one, that is).

Due to our reliance on Unix shell scripting in the test suite, combined
with the challenges on executing such scripts on Windows, the Windows
job currently takes a whopping ~1h20m to complete. Which is *far* longer
than the next-longest job takes (linux-gcc, ~35m).

Now, Azure Pipelines's free tier for open source projects (such as Git)
offers up to 10 concurrent jobs for free, meaning that the overall run
time will be dominated by the slowest job(s).

Therefore, it makes sense to start the Windows job first, to minimize
the time the entire build takes from start to end (which is now pretty
safely the run time of the Windows job).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Instead of a shallow fetch followed by a sparse checkout, we are
better off by using a separate, dedicated Pipeline that bundles
the SDK as a build artifact, and then consuming that build artifact
here.

In fact, since this artifact will be used a lot, we spent substantial
time on figuring out a minimal subset of the Git for Windows SDK, just
enough to build and test Git. The result is a size reduction from around
1GB (compressed) to around 55MB (compressed). This also comes with the
change where we now call `usr\bin\bash.exe` directly, as `git-cmd.exe`
is not included in the minimal SDK.

That reduces the time to initialize Git for Windows' SDK from anywhere
between 2m30s-7m to a little over 1m.

Note: in theory, we could also use the DownloadBuildArtifacts@0 task
here. However, restricted permissions that are in effect when building
from forks would let this fail for PR builds, defeating the whole
purpose of the Azure Pipelines support for git.git.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Every once in a while, the Azure Pipeline fails with some semi-random

	error: timer thread did not terminate timely

This error message means that the thread that is used to emulate the
setitimer() function did not terminate within 1,000 milliseconds.

The most likely explanation (and therefore the one we should assume to
be true, according to Occam's Razor) is that the timeout of one second
is simply not enough because we try to run so many tasks in parallel.

So let's give it ten seconds instead of only one. That should be enough.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Just like so many other OSS projects, we now also have a build badge.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
It is a bit ridiculous to spin up a full-blown Perl instance (especially
on Windows, where that means spinning up a full POSIX emulation layer,
AKA the MSYS2 runtime) just to tell how large a given file is.

So let's just use the test-tool to do that job instead.

This command will also be used over the next commits, to allow for
cutting out individual test cases' verbose log from the file generated
via --verbose-log.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The JUnit XML format lends itself to be presented in a powerful UI,
where you can drill down to the information you are interested in very
quickly.

For test failures, this usually means that you want to see the detailed
trace of the failing tests.

With Travis CI, we passed the `--verbose-log` option to get those
traces. However, that seems excessive, as we do not need/use the logs in
almost all of those cases: only when a test fails do we have a way to
include the trace.

So let's do something different when using Azure DevOps: let's run all
the tests with `--quiet` first, and only if a failure is encountered,
try to trace the commands as they are executed.

Of course, we cannot turn on `--verbose-log` after the fact. So let's
just re-run the test with all the same options, adding `--verbose-log`.
And then munging the output file into the JUnit XML on the fly.

Note: there is an off chance that re-running the test in verbose mode
"fixes" the failures (and this does happen from time to time!). That is
a possibility we should be able to live with. Ideally, we would label
this as "Passed upon rerun", and Azure Pipelines even know about that
outcome, but it is not available when using the JUnit XML format for
now:
https://github.com/Microsoft/azure-pipelines-agent/blob/master/src/Agent.Worker/TestResults/JunitResultReader.cs

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
It seems that every once in a while in the Git for Windows SDK, there
are some transient file locking issues preventing the test clean up to
delete the trash directory. Let's be gentle and try again five seconds
later, and only error out if it still fails the second time.

This change helps Windows, and does not hurt any other platform
(normally, it is highly unlikely that said deletion fails, and if it
does, normally it will fail again even 5 seconds later).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
We really need to be able to find the test helpers... Really. This
change was forgotten when we moved the test helpers into t/helper/

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@gitgitgadget
Copy link

gitgitgadget bot commented Jan 28, 2019

This patch series was integrated into pu via git@8cb63d7.

dscho added 4 commits January 29, 2019 11:57
When building Git with RUNTIME_PREFIX and starting a test helper from
t/helper/, it fails to detect a system prefix. The reason is that the
RUNTIME_PREFIX feature wants to use the location of the Git executable
to determine where the support files can be found, e.g. system-wide Git
config or the translations. This does not make any sense for the test
helpers, though, as they are distinctly not in a directory structure
resembling the final installation location of Git.

That is the reason why the test helpers rely on environment variables to
indicate the location of the needed support files, e.g.
GIT_TEXTDOMAINDIR. If this information is missing, the output will
contain warnings like this one:

	RUNTIME_PREFIX requested, but prefix computation failed. [...]

In t0061, we did not expect that to happen, and it actually does not
happen in the regular case, because bin-wrappers/test-tool specifically
sets GIT_TEXTDOMAINDIR (and as a consequence, nothing in test-tool needs
to know anything about any runtime prefix).

However, with --with-dashes, bin-wrappers/test-tool is no longer called,
but t/helper/test-tool is called directly instead.

So let's just ignore the RUNTIME_PREFIX warning.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This speeds up the tests by a bit on Windows, where running Unix shell
scripts (and spawning processes) is not exactly a cheap operation.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
As Unix shell scripting comes at a hefty price on Windows, we have to
see where we can save some time to run the test suite.

Let's skip the chain linting and the bin-wrappers/ redirection on
Windows; this seems to shave of anywhere between 10-30% from the overall
runtime.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The fact that Git's test suite is implemented in Unix shell script that
is as portable as we can muster, combined with the fact that Unix shell
scripting is foreign to Windows (and therefore has to be emulated),
results in pretty abysmal speed of the test suite on that platform, for
pretty much no other reason than that language choice.

For comparison: while the Linux build & test is typically done within
about 8 minutes, the Windows build & test typically lasts about 80
minutes in Azure Pipelines.

To help with that, let's use the Azure Pipeline feature where you can
parallelize jobs, make jobs depend on each other, and pass artifacts
between them.

The tests are distributed using the following heuristic: listing all
test scripts ordered by size in descending order (as a cheap way to
estimate the overall run time), every Nth script is run (where N is the
total number of parallel jobs), starting at the index corresponding to
the parallel job. This slicing is performed by a new function that is
added to the `test-tool`.

To optimize the overall runtime of the entire Pipeline, we need to move
the Windows jobs to the beginning (otherwise there would be a very
decent chance for the Pipeline to be run only the Windows build, while
all the parallel Windows test jobs wait for this single one).

We use Azure Pipelines Artifacts for both the minimal Git for Windows
SDK as well as the built executables, as deduplication and caching close
to the agents makes that really fast. For comparison: while downloading
and unpacking the minimal Git for Windows SDK via PowerShell takes only
one minute (down from anywhere between 2.5 to 7 when using a shallow
clone), uploading it as Pipeline Artifact takes less than 30s and
downloading and unpacking less than 20s (sometimes even as little as
only twelve seconds).

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho
Copy link
Member Author

dscho commented Jan 29, 2019

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 29, 2019

Submitted as pull.31.v6.git.gitgitgadget@gmail.com

@dscho
Copy link
Member Author

dscho commented Feb 6, 2019

Made it to next via git@7297a73!

@dscho dscho added the next label Feb 6, 2019
@dscho
Copy link
Member Author

dscho commented Feb 12, 2019

Made it into master via 57cbc53

@dscho dscho closed this Feb 12, 2019
@dscho dscho added the master label Feb 12, 2019
@dscho dscho deleted the vsts-ci branch February 24, 2019 14:03
@@ -139,6 +139,9 @@ do
verbose_log=t
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> It seems that every once in a while in the Git for Windows SDK, there
> are some transient file locking issues preventing the test clean up to
> delete the trash directory. Let's be gentle and try again five seconds
> later, and only error out if it still fails the second time.
>
> This change helps Windows, and does not hurt any other platform
> (normally, it is highly unlikely that said deletion fails, and if it
> does, normally it will fail again even 5 seconds later).
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  t/test-lib.sh | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/t/test-lib.sh b/t/test-lib.sh
> index f31a1c8f79..9c0ca5effb 100644
> --- a/t/test-lib.sh
> +++ b/t/test-lib.sh
> @@ -1104,7 +1104,11 @@ test_done () {
>  			error "Tests passed but trash directory already removed before test cleanup; aborting"
>  
>  			cd "$TRASH_DIRECTORY/.." &&
> -			rm -fr "$TRASH_DIRECTORY" ||
> +			rm -fr "$TRASH_DIRECTORY" || {
> +				# try again in a bit
> +				sleep 5;
> +				rm -fr "$TRASH_DIRECTORY"
> +			} ||
>  			error "Tests passed but test cleanup failed; aborting"
>  		fi
>  		test_at_end_hook_

I saw this sleep while reading some test-lib.sh code, doesn't this break
df4c0d1a79 (test-lib: abort when can't remove trash directory,
2017-04-20) for non-Windows platforms?

Your CL for v3 suggests this was only encountered in Azure VMs:
https://lore.kernel.org/git/pull.31.v3.git.gitgitgadget@gmail.com/

Aside from this obscure issue, wouldn't it make more sense to have some
optional "I'm under CI" flag to skip the teardown one test at a time as
we're probably about to shut off the transitory VM soon?

I skip some tests, but the test suite creates ~950MB of trash for
me. Maybe cheaper for some to just keep that around and have it all
removed at the end.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Johannes Schindelin wrote (reply to this):

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--8323328-464419750-1616163626=:57
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi =C3=86var,

thanks for digging out this old thread, I really could have done never
thinking about it again!

On Wed, 10 Mar 2021, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote:

>
> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
>
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > It seems that every once in a while in the Git for Windows SDK, there
> > are some transient file locking issues preventing the test clean up to
> > delete the trash directory. Let's be gentle and try again five seconds
> > later, and only error out if it still fails the second time.
> >
> > This change helps Windows, and does not hurt any other platform
> > (normally, it is highly unlikely that said deletion fails, and if it
> > does, normally it will fail again even 5 seconds later).
> >
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > ---
> >  t/test-lib.sh | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> > index f31a1c8f79..9c0ca5effb 100644
> > --- a/t/test-lib.sh
> > +++ b/t/test-lib.sh
> > @@ -1104,7 +1104,11 @@ test_done () {
> >  			error "Tests passed but trash directory already removed before tes=
t cleanup; aborting"
> >
> >  			cd "$TRASH_DIRECTORY/.." &&
> > -			rm -fr "$TRASH_DIRECTORY" ||
> > +			rm -fr "$TRASH_DIRECTORY" || {
> > +				# try again in a bit
> > +				sleep 5;
> > +				rm -fr "$TRASH_DIRECTORY"
> > +			} ||
> >  			error "Tests passed but test cleanup failed; aborting"
> >  		fi
> >  		test_at_end_hook_
>
> I saw this sleep while reading some test-lib.sh code, doesn't this break
> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> 2017-04-20) for non-Windows platforms?

It does not really break it, it just delays the inevitable failure.

> Your CL for v3 suggests this was only encountered in Azure VMs:
> https://lore.kernel.org/git/pull.31.v3.git.gitgitgadget@gmail.com/

If by "CL" you refer to the cover letter, then I might have made it sound
as if it was only encountered in the Azure Pipelines agents. I vaguely
seem to remember seeing something like this quite often on my personal
machine, too, though. Most likely Microsoft Defender going a little wild.

> Aside from this obscure issue, wouldn't it make more sense to have some
> optional "I'm under CI" flag to skip the teardown one test at a time as
> we're probably about to shut off the transitory VM soon?

No, I'm not under CI, and I did encounter these issues. And they abruptly
stopped with the patch you apparently still want to discuss ;-)

> I skip some tests, but the test suite creates ~950MB of trash for
> me. Maybe cheaper for some to just keep that around and have it all
> removed at the end.

I don't understand this statement. Or was there a question in it that
you'd like me to answer?

Ciao,
Johannes

--8323328-464419750-1616163626=:57--

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Fri, Mar 19 2021, Johannes Schindelin wrote:

> Hi Ævar,
>
> thanks for digging out this old thread, I really could have done never
> thinking about it again!

I didn't think I'd dug up something so traumatic :)

> On Wed, 10 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
>
>>
>> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
>>
>> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
>> >
>> > It seems that every once in a while in the Git for Windows SDK, there
>> > are some transient file locking issues preventing the test clean up to
>> > delete the trash directory. Let's be gentle and try again five seconds
>> > later, and only error out if it still fails the second time.
>> >
>> > This change helps Windows, and does not hurt any other platform
>> > (normally, it is highly unlikely that said deletion fails, and if it
>> > does, normally it will fail again even 5 seconds later).
>> >
>> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>> > ---
>> >  t/test-lib.sh | 6 +++++-
>> >  1 file changed, 5 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/t/test-lib.sh b/t/test-lib.sh
>> > index f31a1c8f79..9c0ca5effb 100644
>> > --- a/t/test-lib.sh
>> > +++ b/t/test-lib.sh
>> > @@ -1104,7 +1104,11 @@ test_done () {
>> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
>> >
>> >  			cd "$TRASH_DIRECTORY/.." &&
>> > -			rm -fr "$TRASH_DIRECTORY" ||
>> > +			rm -fr "$TRASH_DIRECTORY" || {
>> > +				# try again in a bit
>> > +				sleep 5;
>> > +				rm -fr "$TRASH_DIRECTORY"
>> > +			} ||
>> >  			error "Tests passed but test cleanup failed; aborting"
>> >  		fi
>> >  		test_at_end_hook_
>>
>> I saw this sleep while reading some test-lib.sh code, doesn't this break
>> df4c0d1a79 (test-lib: abort when can't remove trash directory,
>> 2017-04-20) for non-Windows platforms?
>
> It does not really break it, it just delays the inevitable failure.
>
>> Your CL for v3 suggests this was only encountered in Azure VMs:
>> https://lore.kernel.org/git/pull.31.v3.git.gitgitgadget@gmail.com/
>
> If by "CL" you refer to the cover letter, then I might have made it sound
> as if it was only encountered in the Azure Pipelines agents. I vaguely

Yes, the cover letter.

> seem to remember seeing something like this quite often on my personal
> machine, too, though. Most likely Microsoft Defender going a little wild.
>
>> Aside from this obscure issue, wouldn't it make more sense to have some
>> optional "I'm under CI" flag to skip the teardown one test at a time as
>> we're probably about to shut off the transitory VM soon?
>
> No, I'm not under CI, and I did encounter these issues. And they abruptly
> stopped with the patch you apparently still want to discuss ;-)
>
>> I skip some tests, but the test suite creates ~950MB of trash for
>> me. Maybe cheaper for some to just keep that around and have it all
>> removed at the end.
>
> I don't understand this statement. Or was there a question in it that
> you'd like me to answer?

If the fix was purely for Azure's CI setup I was suggesting that a
better solution might be to keep the trash around until the end. But
that's not the only setup as you note here, so let's discard that
suggestion.

In any case, your patch clearly undoes whatever canary for gc issues
df4c0d1a792 was trying to put into the test-lib, but didn't say so in
its commit message.

So I figured it was something that was missed at the time, and that I
should send a quick E-Mail to both authors to see if anyone cared, maybe
nobody does.

It's just something I ran into while reviewing test-lib.sh for some
unrelated changes I was making...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> In any case, your patch clearly undoes whatever canary for gc issues
> df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> its commit message.
>
> So I figured it was something that was missed at the time, and that I
> should send a quick E-Mail to both authors to see if anyone cared, maybe
> nobody does.
>
> It's just something I ran into while reviewing test-lib.sh for some
> unrelated changes I was making...

Good eyes.

I am not quite sure if we are better off catching a transitory
failure, which could go away when retried in 5 seconds, to remove
the test directory (in which case, the change in question is a clear
regression from what df4c0d1a (test-lib: abort when can't remove
trash directory, 2017-04-20) intended to catch), or we only want to
catch permanent failures (in which case the patch in question is a
clear improvement).

In either way, the decision and the rationale behind it should be in
the log message.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Johannes Schindelin wrote (reply to this):

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--8323328-1544568225-1616587311=:50
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi =C3=86var,

On Sat, 20 Mar 2021, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote:

> On Fri, Mar 19 2021, Johannes Schindelin wrote:
>
> >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> >>
> >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> >
> >> > It seems that every once in a while in the Git for Windows SDK, the=
re
> >> > are some transient file locking issues preventing the test clean up=
 to
> >> > delete the trash directory. Let's be gentle and try again five seco=
nds
> >> > later, and only error out if it still fails the second time.
> >> >
> >> > This change helps Windows, and does not hurt any other platform
> >> > (normally, it is highly unlikely that said deletion fails, and if i=
t
> >> > does, normally it will fail again even 5 seconds later).
> >> >
> >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> > ---
> >> >  t/test-lib.sh | 6 +++++-
> >> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> >> > index f31a1c8f79..9c0ca5effb 100644
> >> > --- a/t/test-lib.sh
> >> > +++ b/t/test-lib.sh
> >> > @@ -1104,7 +1104,11 @@ test_done () {
> >> >  			error "Tests passed but trash directory already removed before =
test cleanup; aborting"
> >> >
> >> >  			cd "$TRASH_DIRECTORY/.." &&
> >> > -			rm -fr "$TRASH_DIRECTORY" ||
> >> > +			rm -fr "$TRASH_DIRECTORY" || {
> >> > +				# try again in a bit
> >> > +				sleep 5;
> >> > +				rm -fr "$TRASH_DIRECTORY"
> >> > +			} ||
> >> >  			error "Tests passed but test cleanup failed; aborting"
> >> >  		fi
> >> >  		test_at_end_hook_
> >>
> >> I saw this sleep while reading some test-lib.sh code, doesn't this br=
eak
> >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> >> 2017-04-20) for non-Windows platforms?
> >
> > It does not really break it, it just delays the inevitable failure.

I still think this is the best answer to this (implicit) question:

> In any case, your patch clearly undoes whatever canary for gc issues
> df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> its commit message.

I was not _really_ paying attention to that commit when I implemented the
work-around you mentioned above. At the same time I think it does _not_
undo the canary. If the trash directory cannot be removed via `rm -fr`,
and if that is an indicator for something fishy going on, chances are that
the second `rm -fr` a couple seconds later will _also_ fail, and we still
get that error message.

The only reason why the second `rm` should succeed, at least that I can
think of, is that something on Windows blocked those files from being
deleted, and it is no longer blocking after a couple seconds, and that
usually means that an anti-malware scanned those files.

Ciao,
Dscho

--8323328-1544568225-1616587311=:50--

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Wed, Mar 24, 2021 at 01:01:49PM +0100, Johannes Schindelin wrote:
> Hi Ævar,
> 
> On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
> 
> > On Fri, Mar 19 2021, Johannes Schindelin wrote:
> >
> > >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> > >>
> > >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > >> >
> > >> > It seems that every once in a while in the Git for Windows SDK, there
> > >> > are some transient file locking issues preventing the test clean up to
> > >> > delete the trash directory. Let's be gentle and try again five seconds
> > >> > later, and only error out if it still fails the second time.
> > >> >
> > >> > This change helps Windows, and does not hurt any other platform
> > >> > (normally, it is highly unlikely that said deletion fails, and if it
> > >> > does, normally it will fail again even 5 seconds later).
> > >> >
> > >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > >> > ---
> > >> >  t/test-lib.sh | 6 +++++-
> > >> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > >> >
> > >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> > >> > index f31a1c8f79..9c0ca5effb 100644
> > >> > --- a/t/test-lib.sh
> > >> > +++ b/t/test-lib.sh
> > >> > @@ -1104,7 +1104,11 @@ test_done () {
> > >> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
> > >> >
> > >> >  			cd "$TRASH_DIRECTORY/.." &&
> > >> > -			rm -fr "$TRASH_DIRECTORY" ||
> > >> > +			rm -fr "$TRASH_DIRECTORY" || {
> > >> > +				# try again in a bit
> > >> > +				sleep 5;
> > >> > +				rm -fr "$TRASH_DIRECTORY"
> > >> > +			} ||
> > >> >  			error "Tests passed but test cleanup failed; aborting"
> > >> >  		fi
> > >> >  		test_at_end_hook_
> > >>
> > >> I saw this sleep while reading some test-lib.sh code, doesn't this break
> > >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> > >> 2017-04-20) for non-Windows platforms?
> > >
> > > It does not really break it, it just delays the inevitable failure.
> 
> I still think this is the best answer to this (implicit) question:
> 
> > In any case, your patch clearly undoes whatever canary for gc issues
> > df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> > its commit message.
> 
> I was not _really_ paying attention to that commit when I implemented the
> work-around you mentioned above. At the same time I think it does _not_
> undo the canary. If the trash directory cannot be removed via `rm -fr`,
> and if that is an indicator for something fishy going on, chances are that
> the second `rm -fr` a couple seconds later will _also_ fail, and we still
> get that error message.
> 
> The only reason why the second `rm` should succeed, at least that I can
> think of, is that something on Windows blocked those files from being
> deleted, and it is no longer blocking after a couple seconds, and that
> usually means that an anti-malware scanned those files.

Both commits referenced in df4c0d1a79's log message fixed races
between 'test_done's cleanup and a still running background 'git gc',
and df4c0d1a79 was meant to draw our attention to similar issues in
the future.  And it did:

  https://public-inbox.org/git/20190602091919.GN951@szeder.dev/
  
So no, the failure is not inevitable, there are other reasons why the
second 'rm' might still succeed after the first failed, even just a
fraction of a second later.  And yes, that 'sleep' added in the patch
above did unquestionably break that canary,

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Wed, Mar 24 2021, SZEDER Gábor wrote:

> On Wed, Mar 24, 2021 at 01:01:49PM +0100, Johannes Schindelin wrote:
>> Hi Ævar,
>> 
>> On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
>> 
>> > On Fri, Mar 19 2021, Johannes Schindelin wrote:
>> >
>> > >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
>> > >>
>> > >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
>> > >> >
>> > >> > It seems that every once in a while in the Git for Windows SDK, there
>> > >> > are some transient file locking issues preventing the test clean up to
>> > >> > delete the trash directory. Let's be gentle and try again five seconds
>> > >> > later, and only error out if it still fails the second time.
>> > >> >
>> > >> > This change helps Windows, and does not hurt any other platform
>> > >> > (normally, it is highly unlikely that said deletion fails, and if it
>> > >> > does, normally it will fail again even 5 seconds later).
>> > >> >
>> > >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>> > >> > ---
>> > >> >  t/test-lib.sh | 6 +++++-
>> > >> >  1 file changed, 5 insertions(+), 1 deletion(-)
>> > >> >
>> > >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
>> > >> > index f31a1c8f79..9c0ca5effb 100644
>> > >> > --- a/t/test-lib.sh
>> > >> > +++ b/t/test-lib.sh
>> > >> > @@ -1104,7 +1104,11 @@ test_done () {
>> > >> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
>> > >> >
>> > >> >  			cd "$TRASH_DIRECTORY/.." &&
>> > >> > -			rm -fr "$TRASH_DIRECTORY" ||
>> > >> > +			rm -fr "$TRASH_DIRECTORY" || {
>> > >> > +				# try again in a bit
>> > >> > +				sleep 5;
>> > >> > +				rm -fr "$TRASH_DIRECTORY"
>> > >> > +			} ||
>> > >> >  			error "Tests passed but test cleanup failed; aborting"
>> > >> >  		fi
>> > >> >  		test_at_end_hook_
>> > >>
>> > >> I saw this sleep while reading some test-lib.sh code, doesn't this break
>> > >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
>> > >> 2017-04-20) for non-Windows platforms?
>> > >
>> > > It does not really break it, it just delays the inevitable failure.
>> 
>> I still think this is the best answer to this (implicit) question:
>> 
>> > In any case, your patch clearly undoes whatever canary for gc issues
>> > df4c0d1a792 was trying to put into the test-lib, but didn't say so in
>> > its commit message.
>> 
>> I was not _really_ paying attention to that commit when I implemented the
>> work-around you mentioned above. At the same time I think it does _not_
>> undo the canary. If the trash directory cannot be removed via `rm -fr`,
>> and if that is an indicator for something fishy going on, chances are that
>> the second `rm -fr` a couple seconds later will _also_ fail, and we still
>> get that error message.
>> 
>> The only reason why the second `rm` should succeed, at least that I can
>> think of, is that something on Windows blocked those files from being
>> deleted, and it is no longer blocking after a couple seconds, and that
>> usually means that an anti-malware scanned those files.
>
> Both commits referenced in df4c0d1a79's log message fixed races
> between 'test_done's cleanup and a still running background 'git gc',
> and df4c0d1a79 was meant to draw our attention to similar issues in
> the future.  And it did:
>
>   https://public-inbox.org/git/20190602091919.GN951@szeder.dev/
>   
> So no, the failure is not inevitable, there are other reasons why the
> second 'rm' might still succeed after the first failed, even just a
> fraction of a second later.  And yes, that 'sleep' added in the patch
> above did unquestionably break that canary,

Having read that thread now I agree, but I also left with a "who cares?"
and "so let's keep the sleep then?".

I.e. is this a problem that any of the software we're maintaining is
going to care about in the wild, it's not like people are expecting gc,
repack, fast-import etc. to behave well in the face of rm -rfing the
directory they're operating on.

So it seems like just an issue that crops up because of how our test
suite manages and removes per-test trash directories. So it seems better
to:

 1. Just keep that "sleep a bit" and retry hack

 2. Maybe on some/most platforms we can use cgroups or whatever passes
    for a reliable "I started a process tree starting at this PID, kill
    -9 the whole thing please" before cleanup these days.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Wed, Mar 24, 2021 at 11:57:46PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Mar 24 2021, SZEDER Gábor wrote:
> 
> > On Wed, Mar 24, 2021 at 01:01:49PM +0100, Johannes Schindelin wrote:
> >> Hi Ævar,
> >> 
> >> On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
> >> 
> >> > On Fri, Mar 19 2021, Johannes Schindelin wrote:
> >> >
> >> > >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> >> > >>
> >> > >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> > >> >
> >> > >> > It seems that every once in a while in the Git for Windows SDK, there
> >> > >> > are some transient file locking issues preventing the test clean up to
> >> > >> > delete the trash directory. Let's be gentle and try again five seconds
> >> > >> > later, and only error out if it still fails the second time.
> >> > >> >
> >> > >> > This change helps Windows, and does not hurt any other platform
> >> > >> > (normally, it is highly unlikely that said deletion fails, and if it
> >> > >> > does, normally it will fail again even 5 seconds later).
> >> > >> >
> >> > >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> > >> > ---
> >> > >> >  t/test-lib.sh | 6 +++++-
> >> > >> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >> > >> >
> >> > >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> >> > >> > index f31a1c8f79..9c0ca5effb 100644
> >> > >> > --- a/t/test-lib.sh
> >> > >> > +++ b/t/test-lib.sh
> >> > >> > @@ -1104,7 +1104,11 @@ test_done () {
> >> > >> >  			error "Tests passed but trash directory already removed before test cleanup; aborting"
> >> > >> >
> >> > >> >  			cd "$TRASH_DIRECTORY/.." &&
> >> > >> > -			rm -fr "$TRASH_DIRECTORY" ||
> >> > >> > +			rm -fr "$TRASH_DIRECTORY" || {
> >> > >> > +				# try again in a bit
> >> > >> > +				sleep 5;
> >> > >> > +				rm -fr "$TRASH_DIRECTORY"
> >> > >> > +			} ||
> >> > >> >  			error "Tests passed but test cleanup failed; aborting"
> >> > >> >  		fi
> >> > >> >  		test_at_end_hook_
> >> > >>
> >> > >> I saw this sleep while reading some test-lib.sh code, doesn't this break
> >> > >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> >> > >> 2017-04-20) for non-Windows platforms?
> >> > >
> >> > > It does not really break it, it just delays the inevitable failure.
> >> 
> >> I still think this is the best answer to this (implicit) question:
> >> 
> >> > In any case, your patch clearly undoes whatever canary for gc issues
> >> > df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> >> > its commit message.
> >> 
> >> I was not _really_ paying attention to that commit when I implemented the
> >> work-around you mentioned above. At the same time I think it does _not_
> >> undo the canary. If the trash directory cannot be removed via `rm -fr`,
> >> and if that is an indicator for something fishy going on, chances are that
> >> the second `rm -fr` a couple seconds later will _also_ fail, and we still
> >> get that error message.
> >> 
> >> The only reason why the second `rm` should succeed, at least that I can
> >> think of, is that something on Windows blocked those files from being
> >> deleted, and it is no longer blocking after a couple seconds, and that
> >> usually means that an anti-malware scanned those files.
> >
> > Both commits referenced in df4c0d1a79's log message fixed races
> > between 'test_done's cleanup and a still running background 'git gc',
> > and df4c0d1a79 was meant to draw our attention to similar issues in
> > the future.  And it did:
> >
> >   https://public-inbox.org/git/20190602091919.GN951@szeder.dev/
> >   
> > So no, the failure is not inevitable, there are other reasons why the
> > second 'rm' might still succeed after the first failed, even just a
> > fraction of a second later.  And yes, that 'sleep' added in the patch
> > above did unquestionably break that canary,
> 
> Having read that thread now I agree, but I also left with a "who cares?"
> and "so let's keep the sleep then?".
> 
> I.e. is this a problem that any of the software we're maintaining is
> going to care about in the wild, it's not like people are expecting gc,
> repack, fast-import etc. to behave well in the face of rm -rfing the
> directory they're operating on.
> 
> So it seems like just an issue that crops up because of how our test
> suite manages and removes per-test trash directories.

Not at all.  The real problem is that some stray background git
process is *still* actively writing to the test repository when the
test script is already supposed to be finished.

> So it seems better
> to:
> 
>  1. Just keep that "sleep a bit" and retry hack
> 
>  2. Maybe on some/most platforms we can use cgroups or whatever passes
>     for a reliable "I started a process tree starting at this PID, kill
>     -9 the whole thing please" before cleanup these days.

What really seems better:

  3. Keep that "sleep a bit" hack only on platforms that can't delete
     opened files.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 10, 2021

User Ævar Arnfjörð Bjarmason <avarab@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 24, 2021

User SZEDER Gábor <szeder.dev@gmail.com> has been added to the cc: list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant