-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post a summary of the flaky tests to the commit #45798
Conversation
Open in CodeSandbox Web Editor | VS Code | VS Code Insiders |
Size Change: 0 B Total Size: 1.32 MB ℹ️ View Unchanged
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea. Can you provide instructions for testing with the forked repo?
Sure. It's gonna be very tedious though so bear with me if I miss any detail 😅 .
const { test, expect } = require( '@wordpress/e2e-test-utils-playwright' );
test.describe( 'Flaky test', () => {
test( 'should be flaky', async ( {}, testInfo ) => {
expect( testInfo.retry ).toBeGreaterThan( 1 );
} );
} );
|
This worked as advertised for me on a repo fork. I wonder if it would be better to add a comment to the PR rather than the commit when it is run against a PR as the comment against the commit can be easily overlooked, eg. below the size change comment is much more obvious than the flakey test comment bubble against the commit: |
True! I've thought about something similar. Since these comments are actually related to the commits but not the PR, we will need to update the comment to include the commit hash or link to clarify that. I wonder if we should still keep the comment for older commits too, then we might want to keep posting to the commit. I guess posting to the commit is a good enough solution for now, and we can iterate if we find it helpful. |
Yeh, adding a comment to the PR could be a follow up. |
@youknowriad Curious about your thoughts 🙈. |
Sounds good to me. 👍 Makes me thing we're starting to have more and more summaries: flaky tests, bundle size (potentially codesandbox later). A great UX would be a single "PR summary" comment with multiple collapsible details :) |
Yeah, agree! FWIW, this PR doesn't post comments to PRs, yet. Would be nice if we could build a preview website for more advanced content too. For instance, I've been wanting to deploy the Playwright test report HTML somewhere for easier debuggability. |
As a way forward maybe we should:
|
Agreed! Would love to get some reviews! I'll rebase it to resolve the conflicts later. |
88c5940
to
5779016
Compare
@youknowriad Seems like we have to remove/change the required checks if this is merged. 🙇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty happy with this code-wise and the idea seems good. I left a few questions.
I couldn't quite get it working on my fork, I'm not sure if I did something wrong - talldan@32d5f92
Thanks for testing it out! It appears that the problem is that the "Report to GitHub" step only runs in the context of |
7311857
to
f9cb3fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works nicely. An example below (there were more flaky tests than I expected, looks like I didn't need to make a fake one 😄 ):
talldan@fd0e1a0#commitcomment-93272254
We'll have to figure out what to do about the required checks.
Yeah, I noticed the same thing too 😆. Hence that's why we need this for more visible feedback 😅 .
I think here is the doc for that, under step 7. |
Nice work, @kevin940726! By the way, I'm trying to fix the font size picker related flaky tests in #46591. |
This has already introduced an amplification of noise in my GitHub notifications. Given how flakey our tests are it seems like this risks making the flakiness worse, because now not only do those tests not pass, but they also leave comments we have to ignore or delete, or deal with every time we commit or rebase a branch. That is, not only do we have to re-run the tests but we also have to look past these comments, and spend the time navigating to the PR to realize that they were only a distraction. Would love it if instead of adding more noise we could either update an existing comment or simply mark the flakiness somewhere where it only interferes with the people working on the test-flakiness problem. Even with one comment per PR we're still talking about adding unnecessary comments and notifications on a large percentage of the contributions. I do understand the desire to tell people who aren't familiar with how unreliable our tests are that they shouldn't be bothered by the fact that the tests failed, but if we end up doing that by leaving twenty comments on their work we might just make the problem worse 😉 |
I received yet another notification for these comments while I was writing the above reply. It's worth observing that we don't know from the GitHub notification screen what is demanding our attention until we click through and navigate to the PR and (in some cases scroll down the page and) find the comment and realize it was reiterating that our test suite has failed us. That's frustrating. |
I've been thinking about this a bit. The main reason we have flaky tests and retries is to actually prevent the job from failing and not care about these. If the goal is to actually make people care about these tests and do something about them, I think we should probably just go back to what we had before:
|
Feels to me like we tried that for years, and it didn't work. There were still lots of flaky tests. I think there were more people interested in fixing them, but only because it was such a bad situation and a terrible developer experience. |
Not entirely true. The goal is to make these failures more visible as opposed to hiding them in each issue while also unblocking the contributors. So flaky tests won't block the PR but just post a comment to the commit. This however is getting annoying because we currently have too many flaky tests. I don't expect it to be posting too often if our tests are more stable. I totally agree that the notification is a bit too much though. It'd be nice if we could prevent the comment from sending notifications to the commit authors, but that doesn't seem possible on GitHub. Would only posting/updating a single comment on PRs as suggested above sound like a good enough middle ground? In the meantime, we can continue trying to fix as many highly frequent flaky tests as possible. |
Sure, but so far, it doesn't feel like the new situation is any better. At least, that's my impression |
What's the benefit of the comment? Maybe a clean revert would be most prudent to avoid leaving in parts that don't get updated. If anything we could add a comment at the top of the project README telling people not to worry about our poor tests. We already know which ones are flakey (at least hypothetically) so why not make those tests optional for a merge? That is, if the test suite fails, show the results, but don't block the PR? The script I wrote for collecting runtime measurements of our Workflows could be easily modified to rerun failed test suites. If we export the list of individual test cases that fail too it could watch for new flakiness. We could even theoretically call out to that server and ask, "given these test results, are there failures that aren't known to be flakey right now?" and use the response to determine whether to clear or reject the PR. Any option seems better than daily repeatedly reminding every one of us that our tests are unreliable 🙃 |
The motivation is mentioned in the "Why?" section in the PR description. The goal is to make flaky tests more visible during commits/reviews so that we get a chance to fix them. I agree though that the pings seem a bit too much at this stage. We can definitely revert this but I still think we need another way to solve this problem or we will just go back to introducing more flaky tests to the project.
The problem actually lies in the hypothetical part. We don't know if a test is flaky until we successfully retry the failed test. A less flaky one could pass a hundred times but fail the 101st time. We don't know if any given PR solves the flakiness without monitoring it for some time either. As for within a single run, we are already retrying failed tests, and if they pass, we don't block the PR.
This works but rerunning the whole workflow is time-consuming and costly. That's why I went for rerunning only the failed test cases.
I don't think I understand this part. Do we commit the list to the repo? Or store them elsewhere? One thing to consider is that PRs can be branched from any given point of time and a static list stored centrally could be easily out of sync.
FWIW, this system won't post comments if there are no flaky tests in the run. I don't expect this to be a daily reminder if we can improve the stability of our tests. (But still, the notification is kind of annoying 😅) |
In my experience, which could be wrong based on my perception, we're not talking about 1 in 101 test runs that fails. It's more like 1 in 2 test runs, or 3 in 5.
Here I think we're talking about the same thing, but re-running the tests isn't really that time-consuming or costly from a scripting point of view. My script as-written monitors tests and every five minutes tries to rerun them. This is how I've been able to get hundreds of runs for a PR without even trying. We could modify my "without optimizations branch" approach and create what is essentially an empty branch that tracks
We have Workflows and test suites and individual test cases within those suites. we could potentially write out into an artifact the results of each test case (probably with some JSON-export for This has actually been on my plans for the performance tests CI job except it took a lower priority once it was no longer the slowest CI workflow.
In practice I'm worried this will always be the case. I've had maybe ten comments today through rebases while working on trying to understand why all performance tests suddenly stopped working. It's not the biggest problem, but it's the kind of frustration that feels incredibly annoying. It's possibly more prominent for me because I'm trying to fix the root problems behind these issues. |
@dmsnell I'm failing to understand some details here, and I think maybe we're both missing some context. Let me explain how the flaky tests system currently works so that we can be sure we're on the same page. 🙇
All of the above already work before this PR. The motivation for retrying failed tests is to help unblock the contributors so that they don't have to examine those flaky ones individually and manually rerun the workflow. However, this comes with the cost of hiding the flaky tests in each issue, rendering them less visible, hence making our project flakier as PR authors won't catch that. What this PR does is add a fifth step to aggregate those flaky tests into a comment. We try to make it clear in the comment to note that the flaky tests probably aren't related to the PR itself, but the information is still there for examination. The problem now is that the notification is too noisy. Possibly only creating a single comment to the PR (but not the commit) is a better alternative. Hope that this explains the situation a little bit better. Any other solution is always welcome though. For instance, I've been wanting to build an external dashboard to host all these flaky test data elsewhere so that we can do advanced data visualization if possible.
I believe it's still a cost to our CI runner, as we don't have unlimited parallel run durations on GitHub.
Might be, but I still believe we can improve our tests' stability so much more. For instance, the recent Playwright migration tends to have more stable tests than the Puppeteer one (just a feeling there's no proof yet 😅 ). Could you explain further in steps what you have in mind about the "scripting" approach and how it would work? |
probably! just to affirm though, thanks again for working on the testing suites. it's a terribly frustration with the project and I'm glad people are trying to make it better
For this PR maybe this is my biggest gripe. It feels like we're punishing the wrong people, and making noise where it isn't needed. E.g. if the tests are failing in my PR I'm already going to look at those tests and see if they seem related to my changes or not. As anyone who works in the repo should learn rather quickly, if a test fails, it's most likely caused by broken tests and not by broken code. A simple note in the README or on first contributors' PRs might be a way to better communicate this: it's not you, it's our infrastructure.
This seems like a better direction IMO to this problem than nagging people trying to contribute to the project. This is a framework level problem, or an education problem, or both. I'd rather have a central place to review to see and work on fixing the broken tests (though I thought that was the point of the issues already, and if that's correct, then it's another reminder that our process is the problem since we haven't prioritizing fixing the tests we know are unreliable).
I have a growing suspicion that what we saw early on was more due to the fact that there were fewer tests written in Puppeteer and also that we're doing a lot more fixturing and mocking. This as a tradeoff between writing tests that won't fail as often, but in exchange give up testing the behaviors they are there to assert. as we continue to fill out those Playwright tests, let's see how the reliability holds. as we see more and more tests succeed when the app fails, because of that fixturing, we might end up unfeeling that faux reliability.
Oh it's probably not any different than what's already there. The idea was more like what you were talking about with a dashboard and one that would reach out to re-run the failing flakey tests more often. I don't know if 2 or 3 times is enough to get past the flakiness typically, and I don't know - does our existing watcher track things around the repo or just on individual PRs? I am of the understanding that our flakey test watcher is not monitoring the performance tests, but when something happens as it did yesterday and suddenly all Performance Tests workflow runs fail, it would be nice to alert that.
concluding, and bringing this back to the first point, I don't follow this argument. we are not likely to catch the introduction of flakey tests in the PR that introduces them. this is because when they are introduced we can't know yet if they are flakey; it's likely that during development of a branch some tests will fail and then be resolved due to actual code problems and not due to infrastructure problems. however, it's only after merging those tests into the mainline that we will learn they are flakey, and at that point it's too late to catch. they aren't hidden in PRs by letting those PRs pass; they just aren't obstructing developers for the problems someone else (or their past self) introduced. if we have known flakey tests in tracking issues already why do we think that telling random PR authors is going to help. maybe it cools their anxiety, but only until they learn they should assume failed tests are just that - failed tests. |
Thanks! I appreciate your feedback a lot too!
It's not what I experienced though. People often are confused about the broken tests and have no clue whether it's related to their PR or not. I've been asked a few times already. Seeing the big red cross on one's PR is somewhat discouraging, no matter if it's consciously. It also forbids the PR from being merged, so a maintainer has to jump in, double-check that it's unrelated to the PR, and re-run the job until it passes.
The comment is the note IMO. Instead of claiming upfront that our tests are not stable and risking people not trusting our tests or even avoiding contributing to them, we only leave a note when there are flaky tests being caught.
We do have the issues list as a central place to view all the flaky tests. They were reported silently so we had to routinely go to the list to look for something to fix. This PR makes them more visible during PRs, so hopefully we can fix them faster. The Playwright migration is also targeting the flakiest ones to migrate first, so we're prioritizing it, just that there aren't many people working on it right now 😅 .
We already have 355 test cases in Playwright, compared to 449 test cases in Puppeteer. By doing a quick (unverified) search, out of 216 total reported flaky tests, there are currently 174 of them written with Puppeteer. Among the 39 open ones, there are currently 29 of them in Puppeteer as well. This doesn't prove much though as I didn't take flakiness into account. We don't do fixturing or mocking on stuff that is the focus of the test. We usually only do them for clearing/resetting the state which IMO is the best practice. For instance, we don't manually go to the posts page to delete posts between each test while we can just test it once and call the API in other places to do the same thing in a much faster and more reliable way. We had a discussion about this a while ago and we might bring it up again in a more formal manner via an RFC PR.
If it fails too often (more than 2 times in a row) then it's probably a sign that the test is too flaky to be considered valuable, and we should fix it or skip it ASAP. We track it whenever the e2e test job is run, that is, any commit to trunk/release/wp and PR.
I'm not sure if it makes sense to monitor the performance test though. If something breaks the performance test then we should just try fixing it instead of letting it pass with warnings. That's what the check is for in the first place, isn't it?
I agree, this is a valid concern. However, without a dashboard, I think this is the best option that I can think of for now 😞. I should clarify though that the comment isn't actually for PR authors, but more for maintainers or experienced contributors who also happen to be PR authors. They are encouraged to review the flaky tests and keep monitoring them if needed. It should be a goal for us to keep improving the stability of our test cases and the comment is just a tool to help make it more visible. I opened #46785 to make it only post the comment on the PR but not on commits. LMK if that's better and we can merge and iterate from there. |
This makes sense, but I think the problem is fairly widespread and unrelated to the changes in any given PR. I don't have numbers on how many PRs experience flakey tests, but I'm curious if it's normal to have a PR that doesn't. If every PR that I propose gets this alarm then I don't understand how having the comment builds any more trust than a notice up-front. I'm reminded that many open-source projects ship with partially-failing test suites. The reality is that our tests don't warrant trust.
This is where I keep asking who the target audience is. On one hand we're talking about alerting feature developers that they aren't the reason the tests failed, but on the other we're talking about making flakey tests more visible to the infrastructure folks. How are we going to test if this leads to faster response? How does this make these tests more visible by scattering them across individual PRs given that they are already collected in one place? If we haven't seen people go to the issues list and pick up flakey tests, what leads us to believe they will scan through random PRs and look for a comment that may or may not be there, which will potentially show them tests they already came by?
That's the same kind of wishful thinking that I think is the reason that education/lists/comments won't help with our flakey tests. We are fully aware that we have flakey tests but we haven't prioritized fixing them. If perf tests have issues they will have to go through the same prioritization. Right now we have flakey tests in the perf test suite, we just don't monitor them. "Just fixing it" hasn't been working.
Not to my understanding. It's true that if the perf tests fail then the PR is rejected. I have suggested we lift this because the point of those tests is to monitor the performance of the editor. The other E2E suites are there for asserting correct behaviors. If a perf tests fail because of a flakey test it doesn't mean anything other than the test suites are broken. It's more likely I suspect that if an E2E test fails it's because of a real failure (compared to the perf tests) since those tests are intended to track potentially risky flows whereas the perf tests are kind of assuming things work and never expect to fail because of a real flaw. At a minimum what are your plans for measuring the impact of these changes? How have you prosed we know if this alert is achieving its goal(s)? |
The goal is to minimize the number of flaky tests in the project. After the refactoring, Playwright tests tend to have less flaky tests, often times zero, which is proof that it is possible. Tests have less value if they are flaky, we don't know if it's flaky because of poorly written test or if the functionality is actually broken sometimes. Note that keeping the tests stable is also an encouragement for contributors to write more tests, which eventually makes our project more stable. If we don't trust our tests, then nobody will, we are better off just don't write any tests at all.
This is unfortunately true, but I don't think we have better options. The comment will notify the reviewers, which are often maintainers of the project who should be actively monitoring the overall health of our tests.
I have been actively working on fixing many flaky tests, along with many other contributors. Fixing flaky tests is always a priority, just not that high compared to other important features. This is a maintenance job that we just have to keep doing. Perhaps making the report comment more visible will draw more people to help on this too.
I'm not familiar with the perf test, but I'm sure there are folks actively working on maintaining them, aren't there?
This system so far has been helpful for me and other folks to prioritize the flaky tests that we want to fix. I'd say that it's achieving its goal already. Such things might be difficult to measure, but as someone actively working on writing/fixing/migrating/refactoring/reviewing e2e tests, I'd say this is worth the effort. Of course though, if anyone thinks strongly against this that it's still too annoying for them then we can always revert it as you suggested. |
What?
This is an attempt to bring more visibility to flaky tests during reviews. It will post a comment on the commit which has flaky tests.
Why?
The original flaky tests reporter started with the idea of retrying failed tests to unblock contributors in PRs with flaky tests. However, a side-effect is that flaky tests tend to be overlooked and ignored by the author. This PR tries to fix that by posting the summary of flaky tests to the commit without blocking the PR authors.
How?
Tons of improvements are in this PR:
report-flaky-tests
job will be run after all e2e jobs are finished.report-flaky-tests
job if there isn't any flaky tests report.Testing Instructions
I'm not aware of an easy way to test this. It's unfortunately a problem with custom github actions. The best way I can think of is to fork the repo and test it there.
See below for detailed instructions on how to test it in your own fork: #45798 (comment).
Screenshots or screencast
Here's what it might look like in the commit:
Or you can view it here: kevin940726@b87c01c#commitcomment-90061333