The future of our Functional Test Runner #22947

stacey-gammon · 2018-09-11T21:09:19Z

Background

I've had a few good conversations with various folks regarding the state of our functional test runner. We all agree there are things that could be improved with it, and we all agree that replacing leadfoot is desired. The biggest question is how to go about implementing those changes.

Iterate in place
or
Start over from scratch

Start over from scratch

This will be a vertical approach. We'll implement a new FTR and migrate over a small chunk of tests at a time.

Pros	Cons
Faster time to get some cross browser testing (assuming it's faster to implement a new FTR base, and a little migration, rather than the horizontal selenium back end swap approach - debatable)	Longer time to see all tests implemented with cross browser support
Opportunity to review every test and pare back during migration to new FTR	Higher risk of failure if we don't see it through - no incremental benefits along the way
Opportunity to review flakiness during migration
New FTR has `retries` support to eliminate flaky tests from killing CI (still need to be able to track and tackle flakiness though!)
Once migration is complete can remove a ton of custom code
Using standard technologies will be easier for new devs to pick up rather than our custom FTR
Consistency within the new package	Inconsistency at the top level - yet another new way to do things (but we already have so many ways to do things even within the current FTR)

Iterate in place

A horizontal approach - tackling slices at a time. First approach was to swap out leadfoot with selenium web driver but this is proving to be a large task due to:

Direct access of remote
Direct access of elements which offer a different API than selenium (since leadfoot fails to comply with w3c)
Inherent test flakiness making it difficult to focus in on failures in a PR vs existing failures.

We can continue this effort or we can try to tackle thinner horizontal slices. For example, instead of first trying to swap in selenium, we can first add an abstraction layer to eliminate direct calls on elements and the remote object in one PR. The problem I see with this is that it may take longer to see concrete test improvements.

Pros	Cons
Faster time for all engineers writing tests to see some minor benefits (e.g. typescript support as a feature)	Longer time to see more fuller benefits (e.g. "Retry" support, or not having the mouse taken over)
	Will be very difficult to ensure changes don't cause flakiness given how bad the flakiness situation is today
	Timing changes can cause a ton of flakiness, just set network speed to slow and see the onslaught of failures

Discussion

I'm bringing this up now after spending a very long time on flaky dashboard tests. I was finally able to reproduce many of them by changing the network speed to slow. There is no simple way to do this when running the test, it requires manual intervention. Looks like there is support for this in wdio: webdriverio/webdriverio#1803. But once I did that, it's tough to even fix a subset of them because there are just so many failures. I have a PR that should improve flakiness but because it slightly changed the timing on all click events, there are now even more failing tests.

If we had a new FTR in place, it'd be faster for me to go through each test one by one, decide whether it should be migrated to the new FTR, prune unnecessary tests, use a consistent model, and also verify the lack of flakiness.

I have been waffling back and forth on the best approach. I generally dislike the risks involved in starting over from scratch, but I think the flakiness of our current implementation is pushing me over the edge. Being able to assert that all tests going into the new FTR are stable would be a huge win and allow us to focus in on stabilizing small chunks of tests at a time. This would also encourage breaking up the tests into smaller chunks. For instance, I put back filtering tests and now a completely unrelated test is failing for a completely unrelated reason (#22787). Did my test changes introduce some tiny timing change that is killing the other test? It's very strange.

If we can come up with the right deliverables for a new FTR, like:

Bake in a way to test for flakiness (easy way to say, "run this test suite x times")
Stop flakiness from blocking developers via mocha's retries
No mouse take over so can run tests in the background while doing other stuff
Cross browser testing
Consistency within a single package. A documented "right way to write tests" approach. All new functional tests go in here.
Follow more standard test practices so new test engineers are more familiar with the paradigms.

I think it might be worth it to reconsider the vertical approach. I'd be willing to pair up with @silne30 and come up with a new PR for a new FTR, and help migrate over dashboard tests as a first consumer.

cc @spalger @epixa @LeeDr - what are your thoughts? Timing wise and priority, think we could get away with revisiting the vertical approach, with a new FTR?

The text was updated successfully, but these errors were encountered:

nreese · 2018-09-11T21:44:35Z

+1 for implementing a new FTR and migrating over a small chunk of tests at a time.

Iterate in place will require more work in the long run as each step will require passes over all of the functional tests. It would be easier and less error prone to set up a new FTR and then just make a single pass over each test suit as they are migrated to the new FTR.

LeeDr · 2018-09-11T21:45:06Z

Here's my concern. I think almost all the flakiness in our UI tests is caused by not knowing when each action a test makes is really complete. There are so many actions that a test can do in Kibana and the page_object methods or services we use to do that should wait until Kibana is finished responding to that action. I know several people have or are trying to implement ways to waitForRenderingComplete and that type of thing.
I haven't seen anything in a new test runner that resolves this issue. Am I missing something?

LeeDr · 2018-09-11T21:53:46Z

It might help to discuss a concrete example of a flaky test issue in the current FTR and explain what some new test runner would do differently. I just merged a PR to fix some rbac tests where the test would;

logout
login as a different user with specific roles
navigate to visualize/new

But the login method returned immediately after clicking Submit on the login page. What happens when you click that submit and how do we know when it's done? If we don't wait for that the test will try to navigate WHILE that submit action is also changing Kibana and the navigate to visualize/new fails.

Sure, we can check if the navigation worked and retry it if it didn't. But that will create tests that are constantly retrying every action 👎 I'd much rather fix the login method to wait for some indication that the action is done so we have reliable actions. But that seems extremely hard to know in Kibana.

stacey-gammon · 2018-09-11T22:02:03Z

There are multiple issues going on imo.

One is that we often use testSubjects.exists return value to determine our next move in a test. For instance:

    async onDashboardLandingPage() {
      log.debug(`onDashboardLandingPage`);
      const exists = await testSubjects.exists('dashboardLandingPage');
      return exists;
    }

This is flaky if network speed is slow. The exists check only waits max of 1 second so it might return false when the dashboard landing page is just busy loading. We have to stop using .exists to make decisions and start knowing where we are and asserting the next steps, using existsOrFail or missingOrFail. Dashboard tests used the above as a way to simplify tests. gotoDashboardLandingPage can be called whether you are on the landing page, or not, and it'll direct you to the landing page. But because of the above, this is just going to get us into trouble. So each test should know exactly where it is and assert as it goes along.

Another issue I ran across is clicking buttons while they are still loading. Leadfoot executes the click function without complaint but the click might not go through. This manifests as some failures with the save dialog (and is easy to repro when network speed is slow and saving the dashboard takes awhile). Rather than implement at the point of the save dashboard call, I tried baking it into the testSubject service, so any other places that make this mistake will get fixed as well: https://github.com/elastic/kibana/pull/22901/files#diff-6238cdd3f41d65e4fddb017713976292R190. The problem is this changed the timing so now I am seeing other failures show up more frequently in that PR.

Another issue is the waitForRendering issue. This is a legit issue in our code, that manifests in flaky test issues, since we deal with this in reporting. Different implementations of visualizations can have bugs in this - like the timelion one. So each implementation needs to be looked at in isolation - hard to fix them all in one go.

The above is specific to a embeddable rendering, but the tests have their own issues (not related to reporting at all). For instance, clicking on a row while an EUI table is still loading and you might not click the actual row. Another instance I ran across this was with input controls, but that was actually a bug in the code - a different menu flickers on the screen for a sec, but is quickly replaced by another menu. Again I think each of these situations need to be handled individually. Table row clicks need to wait for the table loading indicator to be gone. Interactions with flyouts need to wait for the flyout to finish sliding over (or no animation).

I have been trying to fix a few of the above in #22901 but am hitting another flaky issue. This is the problem I am encountering - with so much flakiness and so much fragility, it's hard to isolate and fix a single problem because the whole test suite then crumbles and more flakiness is exposed! If a PR fixes the flakiness of 1 test but via a side affect exposes the flakiness of 5 more tests... how is that PR supposed to make progress and get checked in? Do we skip all the other ones? When will we find time to delve further into them?

Or, we write a new FTR, get all the extra benefits, and can isolate and focus on each of these issues with flaky tests.

cuff-links · 2018-09-11T22:16:59Z

Here's my concern. I think almost all the flakiness in our UI tests is caused by not knowing when each action a test makes is really complete. There are so many actions that a test can do in Kibana and the page_object methods or services we use to do that should wait until Kibana is finished responding to that action. I know several people have or are trying to implement ways to waitForRenderingComplete and that type of thing.
I haven't seen anything in a new test runner that resolves this issue. Am I missing something?

What is it that the waitForRenderComplete methods do? In my experience with working with Selenium, there are not too many things that I have found that you can't verify unless you just have no DOM access. You can wait for literally any condition as long as it relates to something within the DOM. Whether it be a class, a CSS value, text value, etc. You can have Selenium wait for that. So if you have a visual, textual, other way that lets you know that the render is complete, typically you can wait for that. You would verify it the same way that a user would be able to verify that the software is ready before they were able to continue. The difference is that you do not return from that login method until that wait condition is satisfied.

@LeeDr In the case of your scenario where you log out as one user and log back in as another, what is the indicator that the new user has logged in?

stacey-gammon · 2018-09-11T22:36:00Z

explain what some new test runner would do differently.

I don't think it's so much that the new FTR is going to magically fix flakiness, but that it allows us to work on each flaky test issue in isolation and ensure that the tests in it are all much more stable (esp if we formalize an easy way to check for flakiness when adding a test - minus the current method of just running retest 10 xs, or copy/pasting the loadTestFile('testSuite') line 30x). Maybe integrate a way with github to tell jenkins to test the PR with slow network latency. jenkins test this slow.

What is it that the waitForRenderComplete methods do?

We can't use normal selenium DOM checks for this case because embeddables and visualizations are pluggable, and reporting needs the concrete implementations to tell it when they are done rendering - we have no idea what dom elements they will be loading internally, so there isn't anything we can specifically check for across the board. Well, not unless we told embeddables to add a certain dom element when they are done rendering. We do something similar, but we tell them to alert us via the data-render-complete="true" attribute, which is what the tests and reporting look for. For instance, this is what the dashboard snapshots look for to know when to take the screenshot. Alternatively the test could check for something we know an area chart draws when it's done, but that is kind of cheating. :) At least, reporting can't use that, so we really need a general approach, that works, and that we can use in our tests which also helps test that reporting works.

cuff-links · 2018-09-11T23:51:10Z

@stacey-gammon If there is a reliable way to ensure that the data-render-complete="true" does not show up prematurely, there is no reason that we couldn't just wait for that css attribute to show up.

cjcenizal · 2018-09-12T00:35:30Z

I'm also +1 for implementing a new FTR and migrating over a small chunk of tests at a time.

I also experienced the same frustrations that @stacey-gammon did, and they're the reason I wasn't able to merge what should have been a reasonable improvement to our tests (#21302). All I wanted to do was wait for a toast to appear and then dismiss it. However, additional tests kept breaking, some of them flaky and some of them not, and all of them difficult to analyze. I don't think a new FTR will magically fix this, but a new FTR based on commonly-used tools and patterns will allow me to have greater understanding of our test codebase, greater confidence in the code, and empower engineers like @silne30 and @liza-mae to support me better (this is an important point!).

I'm reassured by the fact that we already have evidence that this approach works. We've migrated large portions of our UI from Angular to React and simultaneously updated our look-and-feel by committing to a long-term vision, providing tools and patterns the entire team can follow (EUI and Jest test guidelines), and then letting the team convert code over one small chunk at a time. As long as we give ownership over this process to a few individuals, they invest heavily in fleshing out the vision up-front, and then move as quickly as possible to open it up to the rest of the team, I think we'll be successful.

stacey-gammon · 2018-09-12T00:37:44Z

If there is a reliable way to ensure that the data-render-complete="true" does not show up prematurely,

That is the crux of the issue. Depending on the specific visualization implementation, some are solid, some are iffy. There is an issue with the timelion implementation. There is also an issue with most implementations that use D3 (but they are different issues) see #22581 and #22601 for background on that issue.

But that is really separate from the FTR and is something we need to fix in code. There is no reason to hold up all tests because of this issue, only a subset of tests are affected. Some can use specific knowledge to wait for render (e.g. a pie slice on a pie). Also, I don't think this specific bug has manifested as actual flakiness on ci. We weren't even aware of it being an issue until reporting implementation changed some timing. The _embeddable_rendering test has never exposed an issue with it in ci, nor dashboard snapshot tests. Not saying that it couldn't become flaky because of timing - it obviously can - just saying that def isn't the sole reason for all our test flakiness.

epixa · 2018-09-12T14:50:54Z

It seems like the most significant issue that this proposal seeks to address is that it is difficult to fix any particular flaky test because so many other tests are flaky. Something like simulated network latency can be hugely useful here, but it's impractical to use because it breaks so many tests.

We'd like to grab tests one by one, separate them out from all the other unfixed tests, fix them, and then have them run in a "stable" place eventually replacing all of our existing tests.

Unless I'm missing something, we don't need an entirely new testing framework and set of tools to address this problem. We can create a new CLI task, selenium job, etc. using the existing tool set and then start moving tests over one by one, fixing them along the way. Then you can try slow browser connections and whatnot without worrying about it impacting a ton of tests that you can't focus on yet.

None of the concerns raised about knowing when things are loaded are test framework concerns, so using a different test framework won't solve them.

This doesn't mean we can't build a new test framework because there are definitely other benefits here, but doing so will take a lot of person time, of which we have very little to spare at the moment. And it seems to be a nice to have rather than a critical component of fixing our tests. It would even be easier to build and roll out if we already fixed test test flakiness issues.

chrisronline · 2018-09-12T15:08:38Z

I might have missed this, but how often are we fixing a test because of an actual code issue instead of an issue with the test environment? I'm trying to understand the relative value of integration tests as a whole and how much time we spend fighting the test environment versus the value it provides in terms of catching actual issues in our source code.

stacey-gammon · 2018-09-12T15:09:01Z

Unless I'm missing something, we don't need an entirely new testing framework and set of tools to address this problem.

Agreed, we don't need a new FTR for this, and if flaky tests were the only reason a new FTR was being proposed, I'd say that's def not enough to justify the effort. It's just one benefit to having a new FTR.

doing so will take a lot of person time, of which we have very little to spare at the moment.

Totally agreed, and if it would take a very long amount of time to see any benefits from a new FTR, I'd also say, lets not do it. I'm bringing it up because it seems like in place improvements are also going to take a long time, so it might actually be as shorter amount of time to get at least some tests on cross browser testing.

Even if we created a separate CLI task to isolate stable and flaky tests, we can't replace an entire back end if it creates way more instability. Unless we only swap out the back end for the tests being run in the "stable" portion. But then we'd have to support different tests running on different backends... and I ask again if it wouldn't just be better to bite the bullet and implement a new FTR as part of this effort.

stacey-gammon · 2018-09-12T15:12:30Z

I'm trying to understand the relative value of integration tests as a whole and how much time we spend fighting the test environment versus the value it provides in terms of catching actual issues in our source code.

I've wasted innumerable hours fighting flaky tests but you can still pry them from my cold, dead hands. They are invaluable given how tough it is to test some of our really wonky infrastucture, like AppState, Courier, time picker. There are rootScope.broadcasts still in the code. Can't unit test that!

The tests suck, and the flakiness sucks, but time and again they have demonstrated their usefulness in catching actual bugs. We need them given how fast we iterate, especially as we try to refactor the messy back end code. @ppisljar or @timroes -- I'm pretty sure you both have you hit some unexpected test failures in your refactoring PRs that exposed legit bugs, is that right?

I was trying to turn _dashboard_filtering tests back on... and found a legit bug. The rendering test flakiness? a legit bug in timelion.

cjcenizal · 2018-09-12T15:54:35Z

@epixa I think flaky tests are only one part of the picture. I think the original intention behind this issue is to figure out the best process for replacing a hand-rolled, legacy system and layers of tech debt with a modern system using conventional tools and consistent patterns. That perspective assumes that our goal is to replace the current FTR, the question is how. If anyone wants to revisit and validate that assumption, @silne30 has put together some materials that justify it (in my opinion), including an analysis of our current system's problems, feedback from other test engineers/architects, and a POC of what a new system could look like.

nreese · 2018-09-12T16:12:56Z

Unless I'm missing something, we don't need an entirely new testing framework and set of tools to address this problem. We can create a new CLI task, selenium job, etc. using the existing tool set and then start moving tests over one by one, fixing them along the way. Then you can try slow browser connections and whatnot without worrying about it impacting a ton of tests that you can't focus on yet.

If we are going to go through the effort of migrating tests one at a time to a new location to fix flakiness, why not use that effort to also migrate the tests to a new FTR? This would avoid having to migrate all of the tests to the new FTR at later time to enable multi-browser support.

LeeDr · 2018-09-12T20:02:43Z

I have this PR to split tests into much smaller jobs so ci might use 15 Jenkins workers but get the tests done in less than 30 minutes. #22359
But that PR in itself won't make the tests less flaky. They'll just make the ci complete quicker and the individual Jenkins console log files smaller. I didn't have any plan to separate tests by flakiness.

I think we're making decent progress on flaky tests. We're getting smarter about understanding the causes and coming up with more robust solutions.
There haven't been any functional failures on main branches so far today. I'd like to have us spend another 2 weeks or so fixing flaky tests and re-evaluate the options here.

liza-mae · 2018-09-12T20:36:05Z

I agree with a lot of what @cjcenizal is saying. As for me the two biggest pain points are the flaky tests and not having more standard/guidelines to follow. I am not sure I understand what role FTR plays in flaky tests, so for me FTR is not much issue right now. I do think we are making progress on flaky tests, but also from what I see a lot of the higher priority flaky tests have just been skipped so more work is needed to fully understand root cause of these failures.

LeeDr · 2018-12-11T02:46:56Z

@spalger created some wrapper services around leadfoot #26406 which makes it easier to replace leadfoot with WebDriver and not have to rewrite all the tests (there will still be some changes).

@silne30 and @dmlemeshko are working now on that WebDriver implementation #26477
And they have plans for documentation and improvements like replacing all the retry loops with WebDriver waitFor calls.

Is this issue still serving a purpose now?

cuff-links · 2018-12-11T18:10:18Z

This particular issue isn't really needed anymore. We have others tracking progress and we have already decided to do an in-place replacement. We can close this.

stacey-gammon added discuss test labels Sep 11, 2018

LeeDr added the Team:QA Team label for QA Team label Oct 10, 2018

cuff-links closed this as completed Dec 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The future of our Functional Test Runner #22947

The future of our Functional Test Runner #22947

stacey-gammon commented Sep 11, 2018

nreese commented Sep 11, 2018

LeeDr commented Sep 11, 2018

LeeDr commented Sep 11, 2018

stacey-gammon commented Sep 11, 2018

cuff-links commented Sep 11, 2018 •

edited

Loading

stacey-gammon commented Sep 11, 2018 •

edited

Loading

cuff-links commented Sep 11, 2018 •

edited

Loading

cjcenizal commented Sep 12, 2018 •

edited

Loading

stacey-gammon commented Sep 12, 2018

epixa commented Sep 12, 2018

chrisronline commented Sep 12, 2018

stacey-gammon commented Sep 12, 2018

stacey-gammon commented Sep 12, 2018

cjcenizal commented Sep 12, 2018

nreese commented Sep 12, 2018

LeeDr commented Sep 12, 2018

liza-mae commented Sep 12, 2018

LeeDr commented Dec 11, 2018

cuff-links commented Dec 11, 2018

The future of our Functional Test Runner #22947

The future of our Functional Test Runner #22947

Comments

stacey-gammon commented Sep 11, 2018

Background

Start over from scratch

Iterate in place

Discussion

nreese commented Sep 11, 2018

LeeDr commented Sep 11, 2018

LeeDr commented Sep 11, 2018

stacey-gammon commented Sep 11, 2018

cuff-links commented Sep 11, 2018 • edited Loading

stacey-gammon commented Sep 11, 2018 • edited Loading

cuff-links commented Sep 11, 2018 • edited Loading

cjcenizal commented Sep 12, 2018 • edited Loading

stacey-gammon commented Sep 12, 2018

epixa commented Sep 12, 2018

chrisronline commented Sep 12, 2018

stacey-gammon commented Sep 12, 2018

stacey-gammon commented Sep 12, 2018

cjcenizal commented Sep 12, 2018

nreese commented Sep 12, 2018

LeeDr commented Sep 12, 2018

liza-mae commented Sep 12, 2018

LeeDr commented Dec 11, 2018

cuff-links commented Dec 11, 2018

cuff-links commented Sep 11, 2018 •

edited

Loading

stacey-gammon commented Sep 11, 2018 •

edited

Loading

cuff-links commented Sep 11, 2018 •

edited

Loading

cjcenizal commented Sep 12, 2018 •

edited

Loading