-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DX Streamline] Establish flaky test eradication program #12127
Comments
I'm pasting in some comments that had from Slack DMs: Valuable comments from @MarcoPolo who lead this effort for go-libp2p: Jotting a couple thoughts:
If you want to do something similar for lotus, I wouldn't recommend using the FlakyTests julia notebook. It's idiosyncratic and someone will spend more time understanding it than rewriting it. I know you didn't ask but here's what I'd recommend:
|
I don't believe this is the case, after getting tangled up not finding secrets I discovered that they were just moved to a new location. |
Glad to be corrected, thanks! |
That's the great thing about the itest setup in lotus, each test gets their own job so they show up separately in the results by default. Otherwise, we'd have to dig in into the artifacts, parse reports, etc. It is not impossible or even that hard by any means, but if we don't have to do that, that's great! To sum it all up, I guess the only missing piece is a periodic data aggregation so that we don't have to query GitHub API directly for all individual runs any time someone wants to inspect the flakiness. We could run it on schedule and store the results in the artifacts. We get 90 day retention tops here, but that should be enough to see improvement/deterioration over time. And if it's not, we can defer to a more long-term storage on S3 for example. |
Thanks all for the updates. On the observable notebook side: Concerning BuildPulse: |
Oh. I was thinking you could make the notebook public but it wouldn't work for anyone else unless they have a secret with the same name in their account (that stores the GITHUB API key). Even if someone had to fork it, that seems fine. (I know we're still exploring and not committed to the Observable path, but if we do go down it, maybe it makes sense to get a FilOz teamspace?) |
@galargh: thanks for educating me. Do any of these jobs have multiple in them or are they always one to one? I thought I had seen multiple PASSes in some jobs, but I didn't look too deeply across a very large sample. If we do have multiple tests in a job, I wonder if we'll still want the deeper analysis. |
I know we need to decide on some next steps here. We've had the observable notebook and BuildPulse ideas. I also saw a comment from IPDX about extending GitHub Monitoring to cover test flakiness using the “passing rerun heuristic”. I can't give more thought on this at the moment. I welcome IPDX to drive here with a decision-table or pros/cons list. I do think its worthwhile for us to try BuildPulse. My meeting with their CEO was good and I like what I saw. It's self-service for us to get started, and the first week is free while we're exploring. The only requirements are that we do junit test output (I don't know if we are but I assume it's easy enough to do?) and wire in a GitHub Action. @galargh : i think it's fine to spend the time to do these things even if we don't go with BuildPulse so folks can try the tool firsthand and in case it helps guide any alternative approach we take. Thanks! |
For anyone watching, FYI that there is work in progress on this. There is the BuildPulse PoC (being enabled by #12225) and IPDX also created some reporting in Grafana at https://ipdx.grafana.net/dashboard/snapshot/TfD5JfybgB0odLmipdZniNgayqr1zv3c (screenshot below) |
@galargh : concerning the Grafana dashboard...
(We don't have to invest time here on this yet. Once we compare with the BuildPulse PoC, we can see where we want to double-down.) |
2024-09-04 status update: we're still looking at BuildPulse:https://app.buildpulse.io/@filecoin-project/lotus We've been seeing these issues:
@BigLep investigation: this appears to be the case for all itests when looking at the test.yml workflow (example). All of the "itest" jobs output with "command-line-arguments". According to ChatGPT:
I assume our custom setup for running
@BigLep investigation: The data actually matches for me when looking at the last two weeks in August: https://app.buildpulse.io/@filecoin-project/lotus/builds?q=&show_type=builds&filter=failed&starting_at=%222024-08-18T00%3A00%3A00.000Z%22&ending_at=%222024-09-01T00%3A00%3A00.000Z%22 The case where there is a failed test.yml that doesn't show up in buildpulse is when an earlier job in the workflow doesn't complete (e.g., "cache dependencies" - example) I think the next step is figure out the "command-line-arguments" issue. |
Sharing a quick update from go-libp2p. We're saving the outputs of So far it has been a great quality of life improvement because:
fwiw, I think this approach should be pretty easy to replicate. There's very little (if any?) go-libp2p specific code there. See libp2p/go-libp2p#2923 for more details. |
Thanks @MarcoPolo, very interesting workflow! Why sqlite though? Wouldn't it be easier to sling some json around for this? Is it just because of the complexity of your flaky query? |
regarding
I just can't find a way around this (yet). Even specifying an |
Maybe a personal preference. I prefer a SQL query for this as it’s declarative and probably faster than the equivalent JS. But of course, json would work just fine. |
$ go test ./itests/. -run "$(git grep -E '^func Test[^(]+\(\w+ \*testing\.T\)' itests/eth_transactions_test.go | sed -E 's/^.*func (Test[^(]+)\(.*/\1/' | awk '{printf "|^%s", $0}' | sed 's/^|//')"
ok github.com/filecoin-project/lotus/itests 28.747s That's the best I can come up with so far for showing the package name properly and avoiding Two downsides: we'd have to be sure to have unique test function names across the files (and remember this into the future) and this is a complicated bit of command-line hackery just to make it show up nicer in BuildPulse and I don't think that's a good enough justification to increase the complexity of our CI pipeline. |
Modifying the .xml file directly before uploading #12434 |
Done Criteria
Why Important
Flaky tests are insidious. We have them in our codebase, and it has two primary problems:
We want to have the standard of "we only merge if CI is green", and that is much easier bar to hold if flakiness has been eradicated.
User/Customer
Maintainers and contributors.
Notes
The text was updated successfully, but these errors were encountered: