Fix issue where VRT tests fail before all screenshots are captured #7965

techanvil · 2023-12-08T09:51:53Z

Bug Description

VRT runs in CI are intermittently failing before all the screenshots are captured, with the following error:

      COMMAND | Command "test" ended with an error after [309.764s]
      COMMAND | Error: Failed to launch the browser process!

Steps to reproduce

Watch for failing VRT test runs in CI.
They will intermittently fail with the above error.
See this example test run: https://github.com/google/site-kit-wp/actions/runs/7104955680/job/19341063414

Screenshots

This error has been seen with a couple of permutations - note there is an additional log message in the 2nd screenshot.

Do not alter or remove anything below. The following sections will be managed by moderators only.

Acceptance criteria

VRT test runs should not fail with an error before all the screenshots are captured and the test has run to completion.

Implementation Brief

Update the Backstop Dockerfile to include manual installation of the openrc and dbus APK packages.
Add the following flags to Backstop config: --no-sandbox, -disable-gpu, --disable-setuid-sandbox, --no-zygote.
See POC PR POC / 7965 VRT Failing Before Screenshots Captured #8018.

Test Coverage

N/A.

QA Brief

We'll need to keep an eye on the VRT CI workflow and see if its reliability improves.

Changelog entry

N/A

The text was updated successfully, but these errors were encountered:

zutigrm · 2023-12-19T12:42:24Z

Not seeing what can be the cause, since I this seems to be CI related, locally it doesn't fail. Unassigning myself

10upsimon · 2024-01-31T07:27:47Z

@techanvil pinging you here after many days of testing, tweaking, testing some more (because you are the issue creator and may have more insight).

First off, there are not a ton of documented cases of this happening online. For the cases that do arise, the solutions vary, but the common suggestions and/or reasons include:

Needing to manually install chromium as a package within the OS, manually via the Dockerfile. We already do this.
Manually adding the openrc and dbus APK packages within the Alpine Linux powered OS, and adding it as a runtime service with rc-update. I have done this via the linked POC PR
Removing the headless mode of backstop, but I did not think that was feasible in this case
Adding the following flags to Backstop config: --no-sandbox, -disable-gpu, --disable-setuid-sandbox, --no-zygote. We already had --no-sandbox defined in our config, I added the other flags as part of the linked PR.

Another topic that came up a few times was that of disk availability. To test this theory and whether it was indeed a disk space issue, I updated the visual-regression.yml config to include steps that check disk space usage. In my testing, we seldom exceeded 69-72%, so we can rule that out entirely.

When I made the above changes, I saw an instant improvement in the occurrence of this error. In fact, for many runs and re-runs, I simply was not able to observe the error again. Unfortunately, it did return every now and then but far and few between (I ran many many test runs).

What I did however observe is that when I run the VRT workflow before around 9am my time, which I'd argue is when most of the US and EU tech industries are still relatively inactive compared to business hours, I had a 100% success rate. This could simply be a coincidence, but the later on in the day that I (re) ran the workflow, the more likely I was to encounter the error again, although arguably still less frequently than before.

Using the same Linux config in local Docker based testing, I was never able to reproduce the issue, but again this could too be a coincidence.

This leaves me to conclude, and really with not much support or information, that this is likely a GitHub resource availability or usage issue, as the workers are shared.

My suggestion would be to apply the following changes and see if we see a general decrease in the occurrence:

Update the backstop Dockerfile to include manual installation of the openrc and dbus APK packages
Add the following flags to Backstop config: --no-sandbox, -disable-gpu, --disable-setuid-sandbox, --no-zygote

Beyond that, I would not discount the fact that maybe this is only happening on the Alpine Linux image? We could try with another distro to test this theory.

Thoughts?

techanvil · 2024-01-31T18:16:28Z

Thanks @10upsimon, this sounds like a tricky one to diagnose so nice work on the R&D so far.

TL;DR: Let's try what you've suggested for now, see how we get on, and refine things/diagnose further in a separate issue.

As mentioned, nice work so far. The fact you've seen an apparent improvement makes me think we should take the pragmatic option of applying these changes in the hope we can immediately improve our CI reliability and thus save time in the development lifecycle.

That said, digging into the specific changes, it's a bit of an assortment of seemingly unconnected tweaks. Did you do enough testing of each in isolation to be sure they are all having an effect? At present it's not clear to me whether each of them is strictly necessary, and it would be nice to be more certain that we are only applying changes with a quantifiable effect. For example, I'm not sure the GH Action VM actually has a GPU for --disable-gpu to be relevant (unless the flag is introduced for some side effect).

The results are also a bit unquantified in terms of how much of a difference it makes. It's good you've seen an apparent positive change, but we could standardise our testing methods for some clearer results.

What I'd suggest, as mentioned is to apply these changes for now, but introduce a followup issue where we do something along these lines:

Create a script to run a GH action repeatedly and capture the results (success/failure at a minimum, and a deeper capture of logs etc would be nice).
Fork the repo to minimise disruption to regular development.
Using our script, run the VRT action on the fork for a set period of time, e.g. 1 day. Analyze the results for frequency and distribution of success/failures and potentially dig in for deeper analysis.
Repeat the above for various permutations to help to quantify which are having an effect.
Consider logging additional details - does Backstop have a verbose/debug logging mode? Or Docker etc? It could also be useful to log additional system resource usage e.g. memory, file handles.
I'd also try a newer version of the Alpine base image as well as other distros as you've suggested.

With the above testing framework in place we'd be at liberty to test ideas that occur to us in a relatively time-efficient manner. The scripts that come out of it could be useful for future GH action debugging too.

How does that sound to you?

binnieshah · 2024-07-02T08:06:53Z

Removing next up label for now after discussion with the team it seems more thought and investigation is needed on this one.

aaemnnosttv · 2024-07-10T15:12:22Z

Thanks both. FWIW the --no-sandbox argument is already used, so there's no change there, only the others are added.

site-kit-wp/tests/backstop/config.js

Line 30 in 97816cc

args: [ '--no-sandbox' ],

IB LGTM, but we do need an estimate here @techanvil. Looking at the changes it should be very quick to do. The suggested testing of the workflow sounds good too but this should be tracked in an internal task.

Fork the repo to minimise disruption to regular development.

Agreed – all the other GH workflows can be disabled to avoid unnecessary GHA activity

Create a script to run a GH action repeatedly and capture the results (success/failure at a minimum, and a deeper capture of logs etc would be nice).

We can do this with an added workflow that runs using the schedule trigger and maybe update the VRT workflow to include the workflow_call trigger so we can invoke it from the new one.

The new workflow should be able to collect information about the run and write that to a file that could be committed back into a file since the repo is for debugging anyways. Analysis can be done later by copying the data over into a spreadsheet, etc.

techanvil · 2024-07-10T15:54:33Z

Thanks @aaemnnosttv, sorry I forgot to estimate 🤦

This one shouldn't take long, I've estimated it as a 7 to accommodate our estimation bump "experiment".

The suggested testing of the workflow sounds good too but this should be tracked in an internal task.

I had already created #8998 as a followup issue for this, would you prefer this to be migrated to an Asana task?

aaemnnosttv · 2024-07-17T20:57:39Z

I had already created #8998 as a followup issue for this, would you prefer this to be migrated to an Asana task?

Yeah I think so, particularly if we're going to get into testing on a fork and all. We could keep that issue as a placeholder for changes that will come out of the investigation though since it doesn't have much going on there yet.

IB ✅

techanvil · 2024-08-06T14:25:22Z

I had already created #8998 as a followup issue for this, would you prefer this to be migrated to an Asana task?

Yeah I think so, particularly if we're going to get into testing on a fork and all. We could keep that issue as a placeholder for changes that will come out of the investigation though since it doesn't have much going on there yet.

Thanks @aaemnnosttv, SGTM. I've created an Asana task for the investigative side and updated #8998 to be a placeholder for applying the findings.

eugene-manuilov · 2024-08-15T14:31:22Z

moving to approval since there is nothing to QA

aaemnnosttv · 2024-08-23T16:39:17Z

VRT now seems to run consistently although it is consistently failing one scenario in mobile. This should be fixed in #9101 .

techanvil added P1 Medium priority Type: Infrastructure Engineering infrastructure & tooling Type: Bug Something isn't working labels Dec 8, 2023

techanvil changed the title ~~Fix issue where VRT tests fail before comparing any images.~~ Fix issue where VRT tests fail before all screenshots are captured Dec 8, 2023

zutigrm self-assigned this Dec 15, 2023

zutigrm removed their assignment Dec 19, 2023

10upsimon self-assigned this Dec 20, 2023

10upsimon mentioned this issue Jan 2, 2024

POC / 7965 VRT Failing Before Screenshots Captured #8018

Closed

18 tasks

aaemnnosttv added the Next Up Issues to prioritize for definition label Jun 26, 2024

binnieshah removed the Next Up Issues to prioritize for definition label Jul 2, 2024

techanvil assigned techanvil and unassigned 10upsimon Jul 10, 2024

techanvil mentioned this issue Jul 10, 2024

Refine the fix applied to prevent/mitigate VRT tests failing before all screenshots are captured #8998

Open

1 task

techanvil removed their assignment Jul 10, 2024

aaemnnosttv self-assigned this Jul 10, 2024

aaemnnosttv assigned techanvil and unassigned aaemnnosttv Jul 10, 2024

techanvil assigned aaemnnosttv and unassigned techanvil Jul 10, 2024

aaemnnosttv removed their assignment Jul 17, 2024

binnieshah added the Team S Issues for Squad 1 label Jul 18, 2024

binnieshah added the Next Up Issues to prioritize for definition label Jul 26, 2024

aaemnnosttv removed the P1 Medium priority label Jul 26, 2024

aaemnnosttv added the P0 High priority label Jul 26, 2024

binnieshah removed the Next Up Issues to prioritize for definition label Aug 7, 2024

zutigrm self-assigned this Aug 13, 2024

zutigrm mentioned this issue Aug 14, 2024

VRT test failing issue #9193

Merged

18 tasks

zutigrm added the QA: Eng Requires specialized QA by an engineer label Aug 14, 2024

zutigrm removed their assignment Aug 14, 2024

eugene-manuilov assigned eugene-manuilov and zutigrm and unassigned eugene-manuilov Aug 14, 2024

zutigrm assigned eugene-manuilov and unassigned zutigrm Aug 15, 2024

eugene-manuilov removed their assignment Aug 15, 2024

zutigrm mentioned this issue Aug 16, 2024

Revert the Fix For VRT failures #9200

Closed

1 task

techanvil mentioned this issue Aug 20, 2024

Fix issue where VRT test fails in CI when launching Chromium #9211

Closed

1 task

aaemnnosttv closed this as completed Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue where VRT tests fail before all screenshots are captured #7965

Fix issue where VRT tests fail before all screenshots are captured #7965

techanvil commented Dec 8, 2023 •

edited by eugene-manuilov

Loading

zutigrm commented Dec 19, 2023 •

edited

Loading

10upsimon commented Jan 31, 2024 •

edited

Loading

techanvil commented Jan 31, 2024

binnieshah commented Jul 2, 2024

aaemnnosttv commented Jul 10, 2024

techanvil commented Jul 10, 2024

aaemnnosttv commented Jul 17, 2024

techanvil commented Aug 6, 2024

eugene-manuilov commented Aug 15, 2024

aaemnnosttv commented Aug 23, 2024

Fix issue where VRT tests fail before all screenshots are captured #7965

Fix issue where VRT tests fail before all screenshots are captured #7965

Comments

techanvil commented Dec 8, 2023 • edited by eugene-manuilov Loading

Bug Description

Steps to reproduce

Screenshots

Acceptance criteria

Implementation Brief

Test Coverage

QA Brief

Changelog entry

zutigrm commented Dec 19, 2023 • edited Loading

10upsimon commented Jan 31, 2024 • edited Loading

techanvil commented Jan 31, 2024

binnieshah commented Jul 2, 2024

aaemnnosttv commented Jul 10, 2024

techanvil commented Jul 10, 2024

aaemnnosttv commented Jul 17, 2024

techanvil commented Aug 6, 2024

eugene-manuilov commented Aug 15, 2024

aaemnnosttv commented Aug 23, 2024

techanvil commented Dec 8, 2023 •

edited by eugene-manuilov

Loading

zutigrm commented Dec 19, 2023 •

edited

Loading

10upsimon commented Jan 31, 2024 •

edited

Loading