-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offer to run CI/PR builds in Azure Pipelines #31
Conversation
c7ee7ea
to
dc1d890
Compare
/submit |
Submitted as pull.31.git.gitgitgadget@gmail.com |
edd8784
to
b96f5fc
Compare
/submit |
Submitted as pull.31.v2.git.gitgitgadget@gmail.com |
/allow jamill |
User jamill already allowed to use GitGitGadget. |
This branch is now known as |
This will come in handy when publishing the results of Git's test suite during an automated Azure DevOps run. Note: we need to make extra sure that invalid UTF-8 encoding is turned into valid UTF-8 (using the Replacement Character, \uFFFD) because t9902's trace contains such invalid byte sequences, and the task in the Azure Pipeline that uploads the test results would refuse to do anything if it was asked to parse an .xml file with invalid UTF-8 in it. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This patch introduces a conditional arm that defines some environment variables and a function that displays the URL given the job id (to identify previous runs for known-good trees). Because Azure Pipeline's macOS agents already have git-lfs and gettext installed, we can leave `BREW_INSTALL_PACKAGES` empty (unlike in Travis' case). Note: this patch does not introduce an Azure Pipelines definition yet; That is left for the next patch. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This commit adds an azure-pipelines.yml file which is Azure DevOps' equivalent to Travis CI's .travis.yml. The main idea is to replicate the Travis configuration as faithfully as possible, to make it easy to compare the Azure Pipeline builds to the Travis ones (spoiler: some parts, especially the macOS jobs, are way faster in Azure Pileines). Meaning: the number and the order of the jobs added in this commit faithfully replicates what we have in .travis.yml. Note: Our .travis.yml configuration has a Windows part that is *not* replicated in the Azure Pipelines definition. The reason is easy to see: As Travis cannot support our Windws needs (even with the preliminary Windows support that was recently added to Travis after waiting for *years* for that feature, our test suite would simply hit Travis' timeout every single time). To make things a bit easier to understand, we refrain from using the `matrix` feature here because (while it is powerful) it can be a bit confusing to users who are not familiar with CI setups. Therefore, we use a separate phase even for similar configurations (such as GCC vs Clang on Linux, GCC vs Clang on macOS). Also, we make use of the shiny new feature we just introduced where the test suite can output JUnit-style .xml files. This information is made available in a nice UI that allows the viewer to filter by phase and/or test number, and to see trends such as: number of (failing) tests, time spent running the test suite, etc. (While this seemingly contradicts the intention to replicate the Travis configuration as faithfully as possible, it is just too nice to show off that capability here already.) Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Previously, we did not have robust support for Windows in our CI definition, simply because Travis cannot accommodate our needs (even after Travis added experimental Windows support very recently, it takes longer than Travis' 50 minute timeout to build Git and run the test suite on Windows). Instead, we used a hack that started a dedicated Azure Pipeline from Travis and waited for the output, often timing out (which is quite fragile, as we found out). With this commit, we finally have first-class support for Windows in our CI definition (in the Azure Pipelines one, that is). Due to our reliance on Unix shell scripting in the test suite, combined with the challenges on executing such scripts on Windows, the Windows job currently takes a whopping ~1h20m to complete. Which is *far* longer than the next-longest job takes (linux-gcc, ~35m). Now, Azure Pipelines's free tier for open source projects (such as Git) offers up to 10 concurrent jobs for free, meaning that the overall run time will be dominated by the slowest job(s). Therefore, it makes sense to start the Windows job first, to minimize the time the entire build takes from start to end (which is now pretty safely the run time of the Windows job). Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Instead of a shallow fetch followed by a sparse checkout, we are better off by using a separate, dedicated Pipeline that bundles the SDK as a build artifact, and then consuming that build artifact here. In fact, since this artifact will be used a lot, we spent substantial time on figuring out a minimal subset of the Git for Windows SDK, just enough to build and test Git. The result is a size reduction from around 1GB (compressed) to around 55MB (compressed). This also comes with the change where we now call `usr\bin\bash.exe` directly, as `git-cmd.exe` is not included in the minimal SDK. That reduces the time to initialize Git for Windows' SDK from anywhere between 2m30s-7m to a little over 1m. Note: in theory, we could also use the DownloadBuildArtifacts@0 task here. However, restricted permissions that are in effect when building from forks would let this fail for PR builds, defeating the whole purpose of the Azure Pipelines support for git.git. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Every once in a while, the Azure Pipeline fails with some semi-random error: timer thread did not terminate timely This error message means that the thread that is used to emulate the setitimer() function did not terminate within 1,000 milliseconds. The most likely explanation (and therefore the one we should assume to be true, according to Occam's Razor) is that the timeout of one second is simply not enough because we try to run so many tasks in parallel. So let's give it ten seconds instead of only one. That should be enough. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Just like so many other OSS projects, we now also have a build badge. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
It is a bit ridiculous to spin up a full-blown Perl instance (especially on Windows, where that means spinning up a full POSIX emulation layer, AKA the MSYS2 runtime) just to tell how large a given file is. So let's just use the test-tool to do that job instead. This command will also be used over the next commits, to allow for cutting out individual test cases' verbose log from the file generated via --verbose-log. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The JUnit XML format lends itself to be presented in a powerful UI, where you can drill down to the information you are interested in very quickly. For test failures, this usually means that you want to see the detailed trace of the failing tests. With Travis CI, we passed the `--verbose-log` option to get those traces. However, that seems excessive, as we do not need/use the logs in almost all of those cases: only when a test fails do we have a way to include the trace. So let's do something different when using Azure DevOps: let's run all the tests with `--quiet` first, and only if a failure is encountered, try to trace the commands as they are executed. Of course, we cannot turn on `--verbose-log` after the fact. So let's just re-run the test with all the same options, adding `--verbose-log`. And then munging the output file into the JUnit XML on the fly. Note: there is an off chance that re-running the test in verbose mode "fixes" the failures (and this does happen from time to time!). That is a possibility we should be able to live with. Ideally, we would label this as "Passed upon rerun", and Azure Pipelines even know about that outcome, but it is not available when using the JUnit XML format for now: https://github.com/Microsoft/azure-pipelines-agent/blob/master/src/Agent.Worker/TestResults/JunitResultReader.cs Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
It seems that every once in a while in the Git for Windows SDK, there are some transient file locking issues preventing the test clean up to delete the trash directory. Let's be gentle and try again five seconds later, and only error out if it still fails the second time. This change helps Windows, and does not hurt any other platform (normally, it is highly unlikely that said deletion fails, and if it does, normally it will fail again even 5 seconds later). Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
We really need to be able to find the test helpers... Really. This change was forgotten when we moved the test helpers into t/helper/ Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This patch series was integrated into pu via git@8cb63d7. |
When building Git with RUNTIME_PREFIX and starting a test helper from t/helper/, it fails to detect a system prefix. The reason is that the RUNTIME_PREFIX feature wants to use the location of the Git executable to determine where the support files can be found, e.g. system-wide Git config or the translations. This does not make any sense for the test helpers, though, as they are distinctly not in a directory structure resembling the final installation location of Git. That is the reason why the test helpers rely on environment variables to indicate the location of the needed support files, e.g. GIT_TEXTDOMAINDIR. If this information is missing, the output will contain warnings like this one: RUNTIME_PREFIX requested, but prefix computation failed. [...] In t0061, we did not expect that to happen, and it actually does not happen in the regular case, because bin-wrappers/test-tool specifically sets GIT_TEXTDOMAINDIR (and as a consequence, nothing in test-tool needs to know anything about any runtime prefix). However, with --with-dashes, bin-wrappers/test-tool is no longer called, but t/helper/test-tool is called directly instead. So let's just ignore the RUNTIME_PREFIX warning. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This speeds up the tests by a bit on Windows, where running Unix shell scripts (and spawning processes) is not exactly a cheap operation. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
As Unix shell scripting comes at a hefty price on Windows, we have to see where we can save some time to run the test suite. Let's skip the chain linting and the bin-wrappers/ redirection on Windows; this seems to shave of anywhere between 10-30% from the overall runtime. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The fact that Git's test suite is implemented in Unix shell script that is as portable as we can muster, combined with the fact that Unix shell scripting is foreign to Windows (and therefore has to be emulated), results in pretty abysmal speed of the test suite on that platform, for pretty much no other reason than that language choice. For comparison: while the Linux build & test is typically done within about 8 minutes, the Windows build & test typically lasts about 80 minutes in Azure Pipelines. To help with that, let's use the Azure Pipeline feature where you can parallelize jobs, make jobs depend on each other, and pass artifacts between them. The tests are distributed using the following heuristic: listing all test scripts ordered by size in descending order (as a cheap way to estimate the overall run time), every Nth script is run (where N is the total number of parallel jobs), starting at the index corresponding to the parallel job. This slicing is performed by a new function that is added to the `test-tool`. To optimize the overall runtime of the entire Pipeline, we need to move the Windows jobs to the beginning (otherwise there would be a very decent chance for the Pipeline to be run only the Windows build, while all the parallel Windows test jobs wait for this single one). We use Azure Pipelines Artifacts for both the minimal Git for Windows SDK as well as the built executables, as deduplication and caching close to the agents makes that really fast. For comparison: while downloading and unpacking the minimal Git for Windows SDK via PowerShell takes only one minute (down from anywhere between 2.5 to 7 when using a shallow clone), uploading it as Pipeline Artifact takes less than 30s and downloading and unpacking less than 20s (sometimes even as little as only twelve seconds). Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
/submit |
Submitted as pull.31.v6.git.gitgitgadget@gmail.com |
Made it to |
Made it into |
@@ -139,6 +139,9 @@ do | |||
verbose_log=t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):
On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> It seems that every once in a while in the Git for Windows SDK, there
> are some transient file locking issues preventing the test clean up to
> delete the trash directory. Let's be gentle and try again five seconds
> later, and only error out if it still fails the second time.
>
> This change helps Windows, and does not hurt any other platform
> (normally, it is highly unlikely that said deletion fails, and if it
> does, normally it will fail again even 5 seconds later).
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
> t/test-lib.sh | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/t/test-lib.sh b/t/test-lib.sh
> index f31a1c8f79..9c0ca5effb 100644
> --- a/t/test-lib.sh
> +++ b/t/test-lib.sh
> @@ -1104,7 +1104,11 @@ test_done () {
> error "Tests passed but trash directory already removed before test cleanup; aborting"
>
> cd "$TRASH_DIRECTORY/.." &&
> - rm -fr "$TRASH_DIRECTORY" ||
> + rm -fr "$TRASH_DIRECTORY" || {
> + # try again in a bit
> + sleep 5;
> + rm -fr "$TRASH_DIRECTORY"
> + } ||
> error "Tests passed but test cleanup failed; aborting"
> fi
> test_at_end_hook_
I saw this sleep while reading some test-lib.sh code, doesn't this break
df4c0d1a79 (test-lib: abort when can't remove trash directory,
2017-04-20) for non-Windows platforms?
Your CL for v3 suggests this was only encountered in Azure VMs:
https://lore.kernel.org/git/pull.31.v3.git.gitgitgadget@gmail.com/
Aside from this obscure issue, wouldn't it make more sense to have some
optional "I'm under CI" flag to skip the teardown one test at a time as
we're probably about to shut off the transitory VM soon?
I skip some tests, but the test suite creates ~950MB of trash for
me. Maybe cheaper for some to just keep that around and have it all
removed at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Johannes Schindelin wrote (reply to this):
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
--8323328-464419750-1616163626=:57
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Hi =C3=86var,
thanks for digging out this old thread, I really could have done never
thinking about it again!
On Wed, 10 Mar 2021, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote:
>
> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
>
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >
> > It seems that every once in a while in the Git for Windows SDK, there
> > are some transient file locking issues preventing the test clean up to
> > delete the trash directory. Let's be gentle and try again five seconds
> > later, and only error out if it still fails the second time.
> >
> > This change helps Windows, and does not hurt any other platform
> > (normally, it is highly unlikely that said deletion fails, and if it
> > does, normally it will fail again even 5 seconds later).
> >
> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > ---
> > t/test-lib.sh | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> > index f31a1c8f79..9c0ca5effb 100644
> > --- a/t/test-lib.sh
> > +++ b/t/test-lib.sh
> > @@ -1104,7 +1104,11 @@ test_done () {
> > error "Tests passed but trash directory already removed before tes=
t cleanup; aborting"
> >
> > cd "$TRASH_DIRECTORY/.." &&
> > - rm -fr "$TRASH_DIRECTORY" ||
> > + rm -fr "$TRASH_DIRECTORY" || {
> > + # try again in a bit
> > + sleep 5;
> > + rm -fr "$TRASH_DIRECTORY"
> > + } ||
> > error "Tests passed but test cleanup failed; aborting"
> > fi
> > test_at_end_hook_
>
> I saw this sleep while reading some test-lib.sh code, doesn't this break
> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> 2017-04-20) for non-Windows platforms?
It does not really break it, it just delays the inevitable failure.
> Your CL for v3 suggests this was only encountered in Azure VMs:
> https://lore.kernel.org/git/pull.31.v3.git.gitgitgadget@gmail.com/
If by "CL" you refer to the cover letter, then I might have made it sound
as if it was only encountered in the Azure Pipelines agents. I vaguely
seem to remember seeing something like this quite often on my personal
machine, too, though. Most likely Microsoft Defender going a little wild.
> Aside from this obscure issue, wouldn't it make more sense to have some
> optional "I'm under CI" flag to skip the teardown one test at a time as
> we're probably about to shut off the transitory VM soon?
No, I'm not under CI, and I did encounter these issues. And they abruptly
stopped with the patch you apparently still want to discuss ;-)
> I skip some tests, but the test suite creates ~950MB of trash for
> me. Maybe cheaper for some to just keep that around and have it all
> removed at the end.
I don't understand this statement. Or was there a question in it that
you'd like me to answer?
Ciao,
Johannes
--8323328-464419750-1616163626=:57--
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):
On Fri, Mar 19 2021, Johannes Schindelin wrote:
> Hi Ævar,
>
> thanks for digging out this old thread, I really could have done never
> thinking about it again!
I didn't think I'd dug up something so traumatic :)
> On Wed, 10 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
>
>>
>> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
>>
>> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
>> >
>> > It seems that every once in a while in the Git for Windows SDK, there
>> > are some transient file locking issues preventing the test clean up to
>> > delete the trash directory. Let's be gentle and try again five seconds
>> > later, and only error out if it still fails the second time.
>> >
>> > This change helps Windows, and does not hurt any other platform
>> > (normally, it is highly unlikely that said deletion fails, and if it
>> > does, normally it will fail again even 5 seconds later).
>> >
>> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>> > ---
>> > t/test-lib.sh | 6 +++++-
>> > 1 file changed, 5 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/t/test-lib.sh b/t/test-lib.sh
>> > index f31a1c8f79..9c0ca5effb 100644
>> > --- a/t/test-lib.sh
>> > +++ b/t/test-lib.sh
>> > @@ -1104,7 +1104,11 @@ test_done () {
>> > error "Tests passed but trash directory already removed before test cleanup; aborting"
>> >
>> > cd "$TRASH_DIRECTORY/.." &&
>> > - rm -fr "$TRASH_DIRECTORY" ||
>> > + rm -fr "$TRASH_DIRECTORY" || {
>> > + # try again in a bit
>> > + sleep 5;
>> > + rm -fr "$TRASH_DIRECTORY"
>> > + } ||
>> > error "Tests passed but test cleanup failed; aborting"
>> > fi
>> > test_at_end_hook_
>>
>> I saw this sleep while reading some test-lib.sh code, doesn't this break
>> df4c0d1a79 (test-lib: abort when can't remove trash directory,
>> 2017-04-20) for non-Windows platforms?
>
> It does not really break it, it just delays the inevitable failure.
>
>> Your CL for v3 suggests this was only encountered in Azure VMs:
>> https://lore.kernel.org/git/pull.31.v3.git.gitgitgadget@gmail.com/
>
> If by "CL" you refer to the cover letter, then I might have made it sound
> as if it was only encountered in the Azure Pipelines agents. I vaguely
Yes, the cover letter.
> seem to remember seeing something like this quite often on my personal
> machine, too, though. Most likely Microsoft Defender going a little wild.
>
>> Aside from this obscure issue, wouldn't it make more sense to have some
>> optional "I'm under CI" flag to skip the teardown one test at a time as
>> we're probably about to shut off the transitory VM soon?
>
> No, I'm not under CI, and I did encounter these issues. And they abruptly
> stopped with the patch you apparently still want to discuss ;-)
>
>> I skip some tests, but the test suite creates ~950MB of trash for
>> me. Maybe cheaper for some to just keep that around and have it all
>> removed at the end.
>
> I don't understand this statement. Or was there a question in it that
> you'd like me to answer?
If the fix was purely for Azure's CI setup I was suggesting that a
better solution might be to keep the trash around until the end. But
that's not the only setup as you note here, so let's discard that
suggestion.
In any case, your patch clearly undoes whatever canary for gc issues
df4c0d1a792 was trying to put into the test-lib, but didn't say so in
its commit message.
So I figured it was something that was missed at the time, and that I
should send a quick E-Mail to both authors to see if anyone cared, maybe
nobody does.
It's just something I ran into while reviewing test-lib.sh for some
unrelated changes I was making...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> In any case, your patch clearly undoes whatever canary for gc issues
> df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> its commit message.
>
> So I figured it was something that was missed at the time, and that I
> should send a quick E-Mail to both authors to see if anyone cared, maybe
> nobody does.
>
> It's just something I ran into while reviewing test-lib.sh for some
> unrelated changes I was making...
Good eyes.
I am not quite sure if we are better off catching a transitory
failure, which could go away when retried in 5 seconds, to remove
the test directory (in which case, the change in question is a clear
regression from what df4c0d1a (test-lib: abort when can't remove
trash directory, 2017-04-20) intended to catch), or we only want to
catch permanent failures (in which case the patch in question is a
clear improvement).
In either way, the decision and the rationale behind it should be in
the log message.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Johannes Schindelin wrote (reply to this):
This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
--8323328-1544568225-1616587311=:50
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Hi =C3=86var,
On Sat, 20 Mar 2021, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote:
> On Fri, Mar 19 2021, Johannes Schindelin wrote:
>
> >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> >>
> >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> >
> >> > It seems that every once in a while in the Git for Windows SDK, the=
re
> >> > are some transient file locking issues preventing the test clean up=
to
> >> > delete the trash directory. Let's be gentle and try again five seco=
nds
> >> > later, and only error out if it still fails the second time.
> >> >
> >> > This change helps Windows, and does not hurt any other platform
> >> > (normally, it is highly unlikely that said deletion fails, and if i=
t
> >> > does, normally it will fail again even 5 seconds later).
> >> >
> >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> > ---
> >> > t/test-lib.sh | 6 +++++-
> >> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> >> > index f31a1c8f79..9c0ca5effb 100644
> >> > --- a/t/test-lib.sh
> >> > +++ b/t/test-lib.sh
> >> > @@ -1104,7 +1104,11 @@ test_done () {
> >> > error "Tests passed but trash directory already removed before =
test cleanup; aborting"
> >> >
> >> > cd "$TRASH_DIRECTORY/.." &&
> >> > - rm -fr "$TRASH_DIRECTORY" ||
> >> > + rm -fr "$TRASH_DIRECTORY" || {
> >> > + # try again in a bit
> >> > + sleep 5;
> >> > + rm -fr "$TRASH_DIRECTORY"
> >> > + } ||
> >> > error "Tests passed but test cleanup failed; aborting"
> >> > fi
> >> > test_at_end_hook_
> >>
> >> I saw this sleep while reading some test-lib.sh code, doesn't this br=
eak
> >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> >> 2017-04-20) for non-Windows platforms?
> >
> > It does not really break it, it just delays the inevitable failure.
I still think this is the best answer to this (implicit) question:
> In any case, your patch clearly undoes whatever canary for gc issues
> df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> its commit message.
I was not _really_ paying attention to that commit when I implemented the
work-around you mentioned above. At the same time I think it does _not_
undo the canary. If the trash directory cannot be removed via `rm -fr`,
and if that is an indicator for something fishy going on, chances are that
the second `rm -fr` a couple seconds later will _also_ fail, and we still
get that error message.
The only reason why the second `rm` should succeed, at least that I can
think of, is that something on Windows blocked those files from being
deleted, and it is no longer blocking after a couple seconds, and that
usually means that an anti-malware scanned those files.
Ciao,
Dscho
--8323328-1544568225-1616587311=:50--
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, SZEDER Gábor wrote (reply to this):
On Wed, Mar 24, 2021 at 01:01:49PM +0100, Johannes Schindelin wrote:
> Hi Ævar,
>
> On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
>
> > On Fri, Mar 19 2021, Johannes Schindelin wrote:
> >
> > >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> > >>
> > >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > >> >
> > >> > It seems that every once in a while in the Git for Windows SDK, there
> > >> > are some transient file locking issues preventing the test clean up to
> > >> > delete the trash directory. Let's be gentle and try again five seconds
> > >> > later, and only error out if it still fails the second time.
> > >> >
> > >> > This change helps Windows, and does not hurt any other platform
> > >> > (normally, it is highly unlikely that said deletion fails, and if it
> > >> > does, normally it will fail again even 5 seconds later).
> > >> >
> > >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> > >> > ---
> > >> > t/test-lib.sh | 6 +++++-
> > >> > 1 file changed, 5 insertions(+), 1 deletion(-)
> > >> >
> > >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> > >> > index f31a1c8f79..9c0ca5effb 100644
> > >> > --- a/t/test-lib.sh
> > >> > +++ b/t/test-lib.sh
> > >> > @@ -1104,7 +1104,11 @@ test_done () {
> > >> > error "Tests passed but trash directory already removed before test cleanup; aborting"
> > >> >
> > >> > cd "$TRASH_DIRECTORY/.." &&
> > >> > - rm -fr "$TRASH_DIRECTORY" ||
> > >> > + rm -fr "$TRASH_DIRECTORY" || {
> > >> > + # try again in a bit
> > >> > + sleep 5;
> > >> > + rm -fr "$TRASH_DIRECTORY"
> > >> > + } ||
> > >> > error "Tests passed but test cleanup failed; aborting"
> > >> > fi
> > >> > test_at_end_hook_
> > >>
> > >> I saw this sleep while reading some test-lib.sh code, doesn't this break
> > >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> > >> 2017-04-20) for non-Windows platforms?
> > >
> > > It does not really break it, it just delays the inevitable failure.
>
> I still think this is the best answer to this (implicit) question:
>
> > In any case, your patch clearly undoes whatever canary for gc issues
> > df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> > its commit message.
>
> I was not _really_ paying attention to that commit when I implemented the
> work-around you mentioned above. At the same time I think it does _not_
> undo the canary. If the trash directory cannot be removed via `rm -fr`,
> and if that is an indicator for something fishy going on, chances are that
> the second `rm -fr` a couple seconds later will _also_ fail, and we still
> get that error message.
>
> The only reason why the second `rm` should succeed, at least that I can
> think of, is that something on Windows blocked those files from being
> deleted, and it is no longer blocking after a couple seconds, and that
> usually means that an anti-malware scanned those files.
Both commits referenced in df4c0d1a79's log message fixed races
between 'test_done's cleanup and a still running background 'git gc',
and df4c0d1a79 was meant to draw our attention to similar issues in
the future. And it did:
https://public-inbox.org/git/20190602091919.GN951@szeder.dev/
So no, the failure is not inevitable, there are other reasons why the
second 'rm' might still succeed after the first failed, even just a
fraction of a second later. And yes, that 'sleep' added in the patch
above did unquestionably break that canary,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):
On Wed, Mar 24 2021, SZEDER Gábor wrote:
> On Wed, Mar 24, 2021 at 01:01:49PM +0100, Johannes Schindelin wrote:
>> Hi Ævar,
>>
>> On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
>>
>> > On Fri, Mar 19 2021, Johannes Schindelin wrote:
>> >
>> > >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
>> > >>
>> > >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
>> > >> >
>> > >> > It seems that every once in a while in the Git for Windows SDK, there
>> > >> > are some transient file locking issues preventing the test clean up to
>> > >> > delete the trash directory. Let's be gentle and try again five seconds
>> > >> > later, and only error out if it still fails the second time.
>> > >> >
>> > >> > This change helps Windows, and does not hurt any other platform
>> > >> > (normally, it is highly unlikely that said deletion fails, and if it
>> > >> > does, normally it will fail again even 5 seconds later).
>> > >> >
>> > >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
>> > >> > ---
>> > >> > t/test-lib.sh | 6 +++++-
>> > >> > 1 file changed, 5 insertions(+), 1 deletion(-)
>> > >> >
>> > >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
>> > >> > index f31a1c8f79..9c0ca5effb 100644
>> > >> > --- a/t/test-lib.sh
>> > >> > +++ b/t/test-lib.sh
>> > >> > @@ -1104,7 +1104,11 @@ test_done () {
>> > >> > error "Tests passed but trash directory already removed before test cleanup; aborting"
>> > >> >
>> > >> > cd "$TRASH_DIRECTORY/.." &&
>> > >> > - rm -fr "$TRASH_DIRECTORY" ||
>> > >> > + rm -fr "$TRASH_DIRECTORY" || {
>> > >> > + # try again in a bit
>> > >> > + sleep 5;
>> > >> > + rm -fr "$TRASH_DIRECTORY"
>> > >> > + } ||
>> > >> > error "Tests passed but test cleanup failed; aborting"
>> > >> > fi
>> > >> > test_at_end_hook_
>> > >>
>> > >> I saw this sleep while reading some test-lib.sh code, doesn't this break
>> > >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
>> > >> 2017-04-20) for non-Windows platforms?
>> > >
>> > > It does not really break it, it just delays the inevitable failure.
>>
>> I still think this is the best answer to this (implicit) question:
>>
>> > In any case, your patch clearly undoes whatever canary for gc issues
>> > df4c0d1a792 was trying to put into the test-lib, but didn't say so in
>> > its commit message.
>>
>> I was not _really_ paying attention to that commit when I implemented the
>> work-around you mentioned above. At the same time I think it does _not_
>> undo the canary. If the trash directory cannot be removed via `rm -fr`,
>> and if that is an indicator for something fishy going on, chances are that
>> the second `rm -fr` a couple seconds later will _also_ fail, and we still
>> get that error message.
>>
>> The only reason why the second `rm` should succeed, at least that I can
>> think of, is that something on Windows blocked those files from being
>> deleted, and it is no longer blocking after a couple seconds, and that
>> usually means that an anti-malware scanned those files.
>
> Both commits referenced in df4c0d1a79's log message fixed races
> between 'test_done's cleanup and a still running background 'git gc',
> and df4c0d1a79 was meant to draw our attention to similar issues in
> the future. And it did:
>
> https://public-inbox.org/git/20190602091919.GN951@szeder.dev/
>
> So no, the failure is not inevitable, there are other reasons why the
> second 'rm' might still succeed after the first failed, even just a
> fraction of a second later. And yes, that 'sleep' added in the patch
> above did unquestionably break that canary,
Having read that thread now I agree, but I also left with a "who cares?"
and "so let's keep the sleep then?".
I.e. is this a problem that any of the software we're maintaining is
going to care about in the wild, it's not like people are expecting gc,
repack, fast-import etc. to behave well in the face of rm -rfing the
directory they're operating on.
So it seems like just an issue that crops up because of how our test
suite manages and removes per-test trash directories. So it seems better
to:
1. Just keep that "sleep a bit" and retry hack
2. Maybe on some/most platforms we can use cgroups or whatever passes
for a reliable "I started a process tree starting at this PID, kill
-9 the whole thing please" before cleanup these days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, SZEDER Gábor wrote (reply to this):
On Wed, Mar 24, 2021 at 11:57:46PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
> On Wed, Mar 24 2021, SZEDER Gábor wrote:
>
> > On Wed, Mar 24, 2021 at 01:01:49PM +0100, Johannes Schindelin wrote:
> >> Hi Ævar,
> >>
> >> On Sat, 20 Mar 2021, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> > On Fri, Mar 19 2021, Johannes Schindelin wrote:
> >> >
> >> > >> On Tue, Jan 29 2019, Johannes Schindelin via GitGitGadget wrote:
> >> > >>
> >> > >> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> > >> >
> >> > >> > It seems that every once in a while in the Git for Windows SDK, there
> >> > >> > are some transient file locking issues preventing the test clean up to
> >> > >> > delete the trash directory. Let's be gentle and try again five seconds
> >> > >> > later, and only error out if it still fails the second time.
> >> > >> >
> >> > >> > This change helps Windows, and does not hurt any other platform
> >> > >> > (normally, it is highly unlikely that said deletion fails, and if it
> >> > >> > does, normally it will fail again even 5 seconds later).
> >> > >> >
> >> > >> > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> >> > >> > ---
> >> > >> > t/test-lib.sh | 6 +++++-
> >> > >> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >> > >> >
> >> > >> > diff --git a/t/test-lib.sh b/t/test-lib.sh
> >> > >> > index f31a1c8f79..9c0ca5effb 100644
> >> > >> > --- a/t/test-lib.sh
> >> > >> > +++ b/t/test-lib.sh
> >> > >> > @@ -1104,7 +1104,11 @@ test_done () {
> >> > >> > error "Tests passed but trash directory already removed before test cleanup; aborting"
> >> > >> >
> >> > >> > cd "$TRASH_DIRECTORY/.." &&
> >> > >> > - rm -fr "$TRASH_DIRECTORY" ||
> >> > >> > + rm -fr "$TRASH_DIRECTORY" || {
> >> > >> > + # try again in a bit
> >> > >> > + sleep 5;
> >> > >> > + rm -fr "$TRASH_DIRECTORY"
> >> > >> > + } ||
> >> > >> > error "Tests passed but test cleanup failed; aborting"
> >> > >> > fi
> >> > >> > test_at_end_hook_
> >> > >>
> >> > >> I saw this sleep while reading some test-lib.sh code, doesn't this break
> >> > >> df4c0d1a79 (test-lib: abort when can't remove trash directory,
> >> > >> 2017-04-20) for non-Windows platforms?
> >> > >
> >> > > It does not really break it, it just delays the inevitable failure.
> >>
> >> I still think this is the best answer to this (implicit) question:
> >>
> >> > In any case, your patch clearly undoes whatever canary for gc issues
> >> > df4c0d1a792 was trying to put into the test-lib, but didn't say so in
> >> > its commit message.
> >>
> >> I was not _really_ paying attention to that commit when I implemented the
> >> work-around you mentioned above. At the same time I think it does _not_
> >> undo the canary. If the trash directory cannot be removed via `rm -fr`,
> >> and if that is an indicator for something fishy going on, chances are that
> >> the second `rm -fr` a couple seconds later will _also_ fail, and we still
> >> get that error message.
> >>
> >> The only reason why the second `rm` should succeed, at least that I can
> >> think of, is that something on Windows blocked those files from being
> >> deleted, and it is no longer blocking after a couple seconds, and that
> >> usually means that an anti-malware scanned those files.
> >
> > Both commits referenced in df4c0d1a79's log message fixed races
> > between 'test_done's cleanup and a still running background 'git gc',
> > and df4c0d1a79 was meant to draw our attention to similar issues in
> > the future. And it did:
> >
> > https://public-inbox.org/git/20190602091919.GN951@szeder.dev/
> >
> > So no, the failure is not inevitable, there are other reasons why the
> > second 'rm' might still succeed after the first failed, even just a
> > fraction of a second later. And yes, that 'sleep' added in the patch
> > above did unquestionably break that canary,
>
> Having read that thread now I agree, but I also left with a "who cares?"
> and "so let's keep the sleep then?".
>
> I.e. is this a problem that any of the software we're maintaining is
> going to care about in the wild, it's not like people are expecting gc,
> repack, fast-import etc. to behave well in the face of rm -rfing the
> directory they're operating on.
>
> So it seems like just an issue that crops up because of how our test
> suite manages and removes per-test trash directories.
Not at all. The real problem is that some stray background git
process is *still* actively writing to the test repository when the
test script is already supposed to be finished.
> So it seems better
> to:
>
> 1. Just keep that "sleep a bit" and retry hack
>
> 2. Maybe on some/most platforms we can use cgroups or whatever passes
> for a reliable "I started a process tree starting at this PID, kill
> -9 the whole thing please" before cleanup these days.
What really seems better:
3. Keep that "sleep a bit" hack only on platforms that can't delete
opened files.
User |
User |
For a long time already, we have tested Git's source code continuously via Travis CI, see e.g. https://travis-ci.org/git/git/builds/421738884. It has served us well, and more and more developers actually pay attention and benefit from the testing this gives us.
It is also an invaluable tool for contributors who can validate their code contributions via PRs on GitHub, e.g. to verify that their tests do actually run on macOS (i.e. with the BSD family of Unix tools instead of the GNU one).
The one sad part about this is the Windows support. Travis used to lack it (Travis announced some early support for Windows, hot off the press), and we work around that by using Azure Pipelines (the CI part of Azure DevOps, formerly known as Visual Studio Team Services) indirectly: one phase in Travis would trigger a build, wait for its log, and then paste that log.
As Git's Windows builds (and tests!) take quite a bit of time, Travis often timed out, or somehow the trigger did not work, and for security reasons (the Windows builds are performed in a private pool of containers), the Windows builds are completely disabled for Pull Requests on GitHub.
One might ask why we did not use Azure Pipelines directly. There were a couple of reasons for that:
The main two reasons no longer apply: macOS agents are available now, as are public projects, i.e. it is possible to configure a Azure Pipelines project so that anybody can view the logs. Since I offered v1, Azure Pipelines has been made available via the GitHub Marketplace, free of cost for open source projects.
I had secured such a public project for Git for Windows already, and I also got one for Git. For now, the latter is hooked up with my personal git.git fork on GitHub, but it is my hope that I convince y'all that these Azure Pipelines builds are a good idea, and then hook it up with https://github.com/git/git.
As a special treat, this patch series adds the ability to present the outcome of Git's test suite as JUnit-style
.xml
files. This allows the Azure Pipelines build to present fun diagrams, trends, and makes it a lot easier to drill down to test failures than before. See for example https://dev.azure.com/git/git/_build/results?buildId=113&view=ms.vss-test-web.test-result-details (you can click on the label of the failed test, and then see the detailed output in the right pane).But maybe you're not interested as much in the presentation of test failures as in the speed? Well, I got you covered with that, too. As of v3, the test suite is run in parallel on Windows, cutting the overall run time to ~33 minutes (see one of the latest builds, for example).
This patch series took waaaaaaaay more time than I had originally anticipated, but I think that in particular the advanced display of the test results and the reduction of the overall run time was worth it. Please let me know what you think about this.
Changes since v5:
sparse
warnings.Changes since v4:
BREW_INSTALL_PACKAGES
is needed.CI_TYPE
variable to guard CI type-specific code.MAKEFLAGS
is now actually configured in the CI type-specific section.GIT_TEST_OPTS
from "Add a build definition for Azure DevOps" to "ci/lib.sh: add support for Azure Pipelines".ci/lib.sh
after theset -ex
statement (as suggested by Gábor a loooong time ago already).Changes since v3:
cut -c
that resulted in an empty trace when failed tests were published (except if the very first test case in a test script failed, then it was correct by mistake).TRAVIS_BRANCH
toCI_BRANCH
already in the first commit, as we should not overrideTRAVIS_BRANCH
with something that it is not.master
to avoid merge conflicts with the recently-mergedsg/stress-test
branch (no, Junio, I really trust myself more than you to resolve those merge conflicts).Changes since v2:
t0000-init.sh
.ci/mount-fileshare.sh
.|| exit 1
, i.e. if they would fail, the build would not have failed.test_atexit_handler
function was scratched; It would be the right thing to do, but is actually an independent topic (it was only implemented in v2 to accommodate the "re-run with verbose log on failure" trick)sed
call.rm
will succeed. We now simply do that, because it won't harm the common case (where the firstrm
succeeds already) and because it helps the Windows job succeed pretty reliably.Changes since v1:
eval
.master
that just happens to be tagged (see https://travis-ci.org/git/git/jobs/424276413 for an incorrectly-skipped build).test_atexit
(which we introduced for that purpose) instead of hard-codingkill_p4d
andstop_git_daemon
.&&
chains.cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: SZEDER Gábor szeder.dev@gmail.com