Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: rerun workflow from failed #28143

Merged
merged 33 commits into from
Nov 22, 2024
Merged

chore: rerun workflow from failed #28143

merged 33 commits into from
Nov 22, 2024

Conversation

seaona
Copy link
Contributor

@seaona seaona commented Oct 29, 2024

Description

This PR adds a new workflow in our circle ci config, called rerun-from-failed which does the following:

  1. It gets the last 20 circle ci workflows from develop branch
  2. It assesses if any of the workflows needs to be rerun. The conditions for a rerun are:
    a. The workflow has only been run once (not retried, or run multiple times, no matter its final status), this is to spare some credits and avoid re-runing multiple times the same jobs, but we could change this, and allow 2 runs instead of 1 for example, if we see that it's needed.
    b. The workflow is completed and has the status of failed
    c. The workflow runs in develop branch
    c. The workflow was triggered by the merge queue bot. This means that we won't rerun scheduled workflows (like the nightly ones). It didn't seem necessary to re-run those, but we can remove the filter, if we want
  3. It reruns from_failed the workflows that have the conditions mentioned above. Note: the circle ci API does not support the rerun_failed_tests feature

This new workflow can be scheduled by circle ci UI panel, and we can choose on which frequency we want it to run. Possibly once every hour (only Mon-Friday), but that's totally customizable from the UI.
Our usage falls within the API limits, which are 51 requests per second per endpoint. In our case we will be doing:

  • 1 GET to get all workflows
  • 20 GET to get each workflow status
  • X POST (a max of 20) to rerun the corresponding failed jobs

everytime we run the re-run-workflow.

Implementation

A few words around the implementation of this setup:

  • This setup uses the API token set in process.env.API_V2_TOKEN for authenticating the circle ci requests
  • This new workflow can be scheduled to be run once a day, twice etc.. depending on our needs, also from the circle ci ui, with the name rerun-from-failed
  • This new workflow can be enabled and disabled from the circle ci ui, just by removing the scheduled job

The initial idea of adding a rerun logic embedded inside the test_and_release, and re-run right after, poses some challenges and that's why making a decoupled workflow and automate that by scheduling seems to solve those better.
One issue is, how to make sure that we are not rerunning from failed forever. That might need additional logic and complexity for tracking the reruns for that specific workflow (possibly creating more artifacts and reading them) into the current workflow.

Another issue is how to ensure that the workflow has finished (no matter if failed or successful) to then apply the rerun if needed:

  • if we used the required keyword, for making the rerun job the last one, that wouldn't serve us, as it would only be run if all jobs were successful (which doesn't solve our task)
  • we could run a job with a timer with ~30mins, so this would make sure that the workflow has finished (no matter, if failed or not) and then could rerun from failed calling the API. That would add additional resources to circle ci though
  • we could add a trigger if job fails on_fail to then trigger the rerun logic, but this would cancel ongoing parallel jobs, and it's not desired as we discussed
  • we could make that each job writes into an artifact their result, but the challenge again comes on when to trigger the read action to that file

I found that decoupling the rerun and relying on their API could benefit in both challenges, as well as doesn't pollute the current ci config, making it a totally independent workflow, that can be customized by the UI.
It also allow us to use more customizable rules, by accessing the state and number of runs of each workflow in a straight forward manner.

Happy to discuss further though :)

Open in GitHub Codespaces

Related issues

Fixes: #25955

Manual testing steps

  1. Check successful ci run for this new job (which in this example, it rerun 1 workflow from failed, successfully): https://app.circleci.com/pipelines/github/MetaMask/metamask-extension/110689/workflows/9ac7aaee-2610-4985-952d-6bd4f747c071/jobs/4141314
  2. Create a branch of out this branch, and remove the filters in the config.yml file, so the new workflow is run. You can then check the result in circle ci

Screenshots/Recordings

See pipeline here: https://app.circleci.com/pipelines/github/MetaMask/metamask-extension/110689/workflows/9ac7aaee-2610-4985-952d-6bd4f747c071/jobs/4141314
It fetched 20 last workflows from develop, from those, it got it status, and rerun only on workflow which complied with all requirements (not being rerun before, and with status failed)

Screenshot from 2024-11-13 19-19-19

Pre-merge author checklist

Pre-merge reviewer checklist

  • I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed).
  • I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots.

Copy link
Contributor

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

* @returns {Promise<any[]>} A promise that resolves to an array of workflow items.
* @throws Will throw an error if the CircleCI token is not defined or if the HTTP request fails.
*/
async function getCircleCiWorkflowsByBranch(branch: string): Promise<any[]> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run

rerun-from-failed:
when:
condition:
equal: ["<< pipeline.schedule.name >>", "rerun-from-failed"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this workflow will only run in develop, and if it's triggered with this exact name

Screenshot from 2024-11-11 16-37-41

@metamaskbot
Copy link
Collaborator

Builds ready [6e99343]
Page Load Metrics (2148 ± 129 ms)
PlatformPageMetricMin (ms)Max (ms)Average (ms)StandardDeviation (ms)MarginOfError (ms)
ChromeHomefirstPaint35225741951568273
domContentLoaded176625162117259124
load178425792148269129
domInteractive28181633718
backgroundConnect885332512
firstReactRender503871467034
getState480232311
initialActions01000
loadScripts12551901154919895
setupStore66016168
uiStartup198229992453327157
Bundle size diffs
  • background: 0 Bytes (0.00%)
  • ui: 0 Bytes (0.00%)
  • common: 0 Bytes (0.00%)

@metamaskbot
Copy link
Collaborator

Builds ready [894483d]
Page Load Metrics (2025 ± 227 ms)
PlatformPageMetricMin (ms)Max (ms)Average (ms)StandardDeviation (ms)MarginOfError (ms)
ChromeHomefirstPaint37940421964642308
domContentLoaded169837271990476228
load172437352025472227
domInteractive198445178
backgroundConnect10161384019
firstReactRender552801084823
getState55414157
initialActions01000
loadScripts120927171461354170
setupStore663252110
uiStartup190140942285518249
Bundle size diffs
  • background: 0 Bytes (0.00%)
  • ui: 0 Bytes (0.00%)
  • common: 0 Bytes (0.00%)

@seaona seaona marked this pull request as ready for review November 12, 2024 08:28
@metamaskbot
Copy link
Collaborator

Builds ready [88c0e66]
Page Load Metrics (2067 ± 87 ms)
PlatformPageMetricMin (ms)Max (ms)Average (ms)StandardDeviation (ms)MarginOfError (ms)
ChromeHomefirstPaint18482512206617785
domContentLoaded18122465203016780
load18462582206718287
domInteractive29264535024
backgroundConnect12118422814
firstReactRender512911118842
getState452051124522
initialActions01000
loadScripts13441887151714268
setupStore6471194
uiStartup207329782447286137
Bundle size diffs
  • background: 0 Bytes (0.00%)
  • ui: 0 Bytes (0.00%)
  • common: 0 Bytes (0.00%)

DDDDDanica
DDDDDanica previously approved these changes Nov 19, 2024
@hjetpoluru hjetpoluru self-requested a review November 19, 2024 20:50
hjetpoluru
hjetpoluru previously approved these changes Nov 19, 2024
.circleci/config.yml Outdated Show resolved Hide resolved
* Note: the API returns the first 20 workflows by default.
* If we wanted to get older workflows, we would need to use the 'page-token' we would get in the first response
* and perform a subsequent request with the 'page-token' parameter.
* This seems unnecessary as of today, as the amount of daily PRs merged to develop is not that high.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good decision. Easier to run this multiple times throughout the day, rather than support paging through more runs.

Copy link
Member

@Gudahtt Gudahtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Just a couple of minor points of feedback, but overall this looks fantastic. Change request is just for the npx tsx step

Co-authored-by: Mark Stacey <markjstacey@gmail.com>
@seaona seaona dismissed stale reviews from hjetpoluru and DDDDDanica via e95eec3 November 20, 2024 17:59
@seaona seaona requested a review from a team as a code owner November 20, 2024 18:37
Copy link
Member

@Gudahtt Gudahtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@metamaskbot
Copy link
Collaborator

Builds ready [e01f7ee]
Page Load Metrics (2092 ± 98 ms)
PlatformPageMetricMin (ms)Max (ms)Average (ms)StandardDeviation (ms)MarginOfError (ms)
ChromeHomefirstPaint183427112096208100
domContentLoaded18162691205719895
load18322704209220598
domInteractive296747136
backgroundConnect9104392612
firstReactRender6111980157
getState6311492136
initialActions01000
loadScripts13392174154118790
setupStore620931
uiStartup212430342389235113
Bundle size diffs
  • background: 0 Bytes (0.00%)
  • ui: 0 Bytes (0.00%)
  • common: 0 Bytes (0.00%)

@seaona seaona added this pull request to the merge queue Nov 22, 2024
Merged via the queue into develop with commit b6613df Nov 22, 2024
77 checks passed
@seaona seaona deleted the rerun-workflow-failed branch November 22, 2024 07:05
@github-actions github-actions bot locked and limited conversation to collaborators Nov 22, 2024
@metamaskbot metamaskbot added the release-12.9.0 Issue or pull request that will be included in release 12.9.0 label Nov 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-qa Relating to QA work (Quality Assurance) release-12.9.0 Issue or pull request that will be included in release 12.9.0 team-extension-platform
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Circle ci: e2e test improvement for retrying failed develop branch
6 participants