-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The future of our Functional Test Runner #22947
Comments
+1 for implementing a new FTR and migrating over a small chunk of tests at a time. Iterate in place will require more work in the long run as each step will require passes over all of the functional tests. It would be easier and less error prone to set up a new FTR and then just make a single pass over each test suit as they are migrated to the new FTR. |
Here's my concern. I think almost all the flakiness in our UI tests is caused by not knowing when each action a test makes is really complete. There are so many actions that a test can do in Kibana and the page_object methods or services we use to do that should wait until Kibana is finished responding to that action. I know several people have or are trying to implement ways to |
It might help to discuss a concrete example of a flaky test issue in the current FTR and explain what some new test runner would do differently. I just merged a PR to fix some rbac tests where the test would;
But the login method returned immediately after clicking Submit on the login page. What happens when you click that submit and how do we know when it's done? If we don't wait for that the test will try to navigate WHILE that submit action is also changing Kibana and the navigate to visualize/new fails. Sure, we can check if the navigation worked and retry it if it didn't. But that will create tests that are constantly retrying every action 👎 I'd much rather fix the login method to wait for some indication that the action is done so we have reliable actions. But that seems extremely hard to know in Kibana. |
There are multiple issues going on imo. One is that we often use
This is flaky if network speed is slow. The Another issue I ran across is clicking buttons while they are still loading. Leadfoot executes the click function without complaint but the click might not go through. This manifests as some failures with the save dialog (and is easy to repro when network speed is slow and saving the dashboard takes awhile). Rather than implement at the point of the save dashboard call, I tried baking it into the testSubject service, so any other places that make this mistake will get fixed as well: https://github.com/elastic/kibana/pull/22901/files#diff-6238cdd3f41d65e4fddb017713976292R190. The problem is this changed the timing so now I am seeing other failures show up more frequently in that PR. Another issue is the The above is specific to a embeddable rendering, but the tests have their own issues (not related to reporting at all). For instance, clicking on a row while an EUI table is still loading and you might not click the actual row. Another instance I ran across this was with input controls, but that was actually a bug in the code - a different menu flickers on the screen for a sec, but is quickly replaced by another menu. Again I think each of these situations need to be handled individually. Table row clicks need to wait for the table loading indicator to be gone. Interactions with flyouts need to wait for the flyout to finish sliding over (or no animation). I have been trying to fix a few of the above in #22901 but am hitting another flaky issue. This is the problem I am encountering - with so much flakiness and so much fragility, it's hard to isolate and fix a single problem because the whole test suite then crumbles and more flakiness is exposed! If a PR fixes the flakiness of 1 test but via a side affect exposes the flakiness of 5 more tests... how is that PR supposed to make progress and get checked in? Do we skip all the other ones? When will we find time to delve further into them? Or, we write a new FTR, get all the extra benefits, and can isolate and focus on each of these issues with flaky tests. |
What is it that the @LeeDr In the case of your scenario where you log out as one user and log back in as another, what is the indicator that the new user has logged in? |
I don't think it's so much that the new FTR is going to magically fix flakiness, but that it allows us to work on each flaky test issue in isolation and ensure that the tests in it are all much more stable (esp if we formalize an easy way to check for flakiness when adding a test - minus the current method of just running
We can't use normal selenium DOM checks for this case because embeddables and visualizations are pluggable, and reporting needs the concrete implementations to tell it when they are done rendering - we have no idea what dom elements they will be loading internally, so there isn't anything we can specifically check for across the board. Well, not unless we told embeddables to add a certain dom element when they are done rendering. We do something similar, but we tell them to alert us via the |
@stacey-gammon If there is a reliable way to ensure that the |
I'm also +1 for implementing a new FTR and migrating over a small chunk of tests at a time. I also experienced the same frustrations that @stacey-gammon did, and they're the reason I wasn't able to merge what should have been a reasonable improvement to our tests (#21302). All I wanted to do was wait for a toast to appear and then dismiss it. However, additional tests kept breaking, some of them flaky and some of them not, and all of them difficult to analyze. I don't think a new FTR will magically fix this, but a new FTR based on commonly-used tools and patterns will allow me to have greater understanding of our test codebase, greater confidence in the code, and empower engineers like @silne30 and @liza-mae to support me better (this is an important point!). I'm reassured by the fact that we already have evidence that this approach works. We've migrated large portions of our UI from Angular to React and simultaneously updated our look-and-feel by committing to a long-term vision, providing tools and patterns the entire team can follow (EUI and Jest test guidelines), and then letting the team convert code over one small chunk at a time. As long as we give ownership over this process to a few individuals, they invest heavily in fleshing out the vision up-front, and then move as quickly as possible to open it up to the rest of the team, I think we'll be successful. |
That is the crux of the issue. Depending on the specific visualization implementation, some are solid, some are iffy. There is an issue with the timelion implementation. There is also an issue with most implementations that use D3 (but they are different issues) see #22581 and #22601 for background on that issue. But that is really separate from the FTR and is something we need to fix in code. There is no reason to hold up all tests because of this issue, only a subset of tests are affected. Some can use specific knowledge to wait for render (e.g. a pie slice on a pie). Also, I don't think this specific bug has manifested as actual flakiness on ci. We weren't even aware of it being an issue until reporting implementation changed some timing. The |
It seems like the most significant issue that this proposal seeks to address is that it is difficult to fix any particular flaky test because so many other tests are flaky. Something like simulated network latency can be hugely useful here, but it's impractical to use because it breaks so many tests. We'd like to grab tests one by one, separate them out from all the other unfixed tests, fix them, and then have them run in a "stable" place eventually replacing all of our existing tests. Unless I'm missing something, we don't need an entirely new testing framework and set of tools to address this problem. We can create a new CLI task, selenium job, etc. using the existing tool set and then start moving tests over one by one, fixing them along the way. Then you can try slow browser connections and whatnot without worrying about it impacting a ton of tests that you can't focus on yet. None of the concerns raised about knowing when things are loaded are test framework concerns, so using a different test framework won't solve them. This doesn't mean we can't build a new test framework because there are definitely other benefits here, but doing so will take a lot of person time, of which we have very little to spare at the moment. And it seems to be a nice to have rather than a critical component of fixing our tests. It would even be easier to build and roll out if we already fixed test test flakiness issues. |
I might have missed this, but how often are we fixing a test because of an actual code issue instead of an issue with the test environment? I'm trying to understand the relative value of integration tests as a whole and how much time we spend fighting the test environment versus the value it provides in terms of catching actual issues in our source code. |
Agreed, we don't need a new FTR for this, and if flaky tests were the only reason a new FTR was being proposed, I'd say that's def not enough to justify the effort. It's just one benefit to having a new FTR.
Totally agreed, and if it would take a very long amount of time to see any benefits from a new FTR, I'd also say, lets not do it. I'm bringing it up because it seems like in place improvements are also going to take a long time, so it might actually be as shorter amount of time to get at least some tests on cross browser testing. Even if we created a separate CLI task to isolate stable and flaky tests, we can't replace an entire back end if it creates way more instability. Unless we only swap out the back end for the tests being run in the "stable" portion. But then we'd have to support different tests running on different backends... and I ask again if it wouldn't just be better to bite the bullet and implement a new FTR as part of this effort. |
I've wasted innumerable hours fighting flaky tests but you can still pry them from my cold, dead hands. They are invaluable given how tough it is to test some of our really wonky infrastucture, like AppState, Courier, time picker. There are rootScope.broadcasts still in the code. Can't unit test that! The tests suck, and the flakiness sucks, but time and again they have demonstrated their usefulness in catching actual bugs. We need them given how fast we iterate, especially as we try to refactor the messy back end code. @ppisljar or @timroes -- I'm pretty sure you both have you hit some unexpected test failures in your refactoring PRs that exposed legit bugs, is that right? I was trying to turn _dashboard_filtering tests back on... and found a legit bug. The rendering test flakiness? a legit bug in timelion. |
@epixa I think flaky tests are only one part of the picture. I think the original intention behind this issue is to figure out the best process for replacing a hand-rolled, legacy system and layers of tech debt with a modern system using conventional tools and consistent patterns. That perspective assumes that our goal is to replace the current FTR, the question is how. If anyone wants to revisit and validate that assumption, @silne30 has put together some materials that justify it (in my opinion), including an analysis of our current system's problems, feedback from other test engineers/architects, and a POC of what a new system could look like. |
If we are going to go through the effort of migrating tests one at a time to a new location to fix flakiness, why not use that effort to also migrate the tests to a new FTR? This would avoid having to migrate all of the tests to the new FTR at later time to enable multi-browser support. |
I have this PR to split tests into much smaller jobs so ci might use 15 Jenkins workers but get the tests done in less than 30 minutes. #22359 I think we're making decent progress on flaky tests. We're getting smarter about understanding the causes and coming up with more robust solutions. |
I agree with a lot of what @cjcenizal is saying. As for me the two biggest pain points are the flaky tests and not having more standard/guidelines to follow. I am not sure I understand what role FTR plays in flaky tests, so for me FTR is not much issue right now. I do think we are making progress on flaky tests, but also from what I see a lot of the higher priority flaky tests have just been skipped so more work is needed to fully understand root cause of these failures. |
@spalger created some wrapper services around leadfoot #26406 which makes it easier to replace leadfoot with WebDriver and not have to rewrite all the tests (there will still be some changes). @silne30 and @dmlemeshko are working now on that WebDriver implementation #26477 Is this issue still serving a purpose now? |
This particular issue isn't really needed anymore. We have others tracking progress and we have already decided to do an in-place replacement. We can close this. |
Background
I've had a few good conversations with various folks regarding the state of our functional test runner. We all agree there are things that could be improved with it, and we all agree that replacing leadfoot is desired. The biggest question is how to go about implementing those changes.
or
Start over from scratch
This will be a vertical approach. We'll implement a new FTR and migrate over a small chunk of tests at a time.
retries
support to eliminate flaky tests from killing CI (still need to be able to track and tackle flakiness though!)Iterate in place
A horizontal approach - tackling slices at a time. First approach was to swap out leadfoot with selenium web driver but this is proving to be a large task due to:
We can continue this effort or we can try to tackle thinner horizontal slices. For example, instead of first trying to swap in selenium, we can first add an abstraction layer to eliminate direct calls on elements and the
remote
object in one PR. The problem I see with this is that it may take longer to see concrete test improvements.Discussion
I'm bringing this up now after spending a very long time on flaky dashboard tests. I was finally able to reproduce many of them by changing the network speed to slow. There is no simple way to do this when running the test, it requires manual intervention. Looks like there is support for this in wdio: webdriverio/webdriverio#1803. But once I did that, it's tough to even fix a subset of them because there are just so many failures. I have a PR that should improve flakiness but because it slightly changed the timing on all click events, there are now even more failing tests.
If we had a new FTR in place, it'd be faster for me to go through each test one by one, decide whether it should be migrated to the new FTR, prune unnecessary tests, use a consistent model, and also verify the lack of flakiness.
I have been waffling back and forth on the best approach. I generally dislike the risks involved in starting over from scratch, but I think the flakiness of our current implementation is pushing me over the edge. Being able to assert that all tests going into the new FTR are stable would be a huge win and allow us to focus in on stabilizing small chunks of tests at a time. This would also encourage breaking up the tests into smaller chunks. For instance, I put back filtering tests and now a completely unrelated test is failing for a completely unrelated reason (#22787). Did my test changes introduce some tiny timing change that is killing the other test? It's very strange.
If we can come up with the right deliverables for a new FTR, like:
retries
I think it might be worth it to reconsider the vertical approach. I'd be willing to pair up with @silne30 and come up with a new PR for a new FTR, and help migrate over dashboard tests as a first consumer.
cc @spalger @epixa @LeeDr - what are your thoughts? Timing wise and priority, think we could get away with revisiting the vertical approach, with a new FTR?
The text was updated successfully, but these errors were encountered: