Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] Remove karma #16263

Closed
wants to merge 1 commit into from
Closed

[test] Remove karma #16263

wants to merge 1 commit into from

Conversation

eps1lon
Copy link
Member

@eps1lon eps1lon commented Jun 17, 2019

test_browser is looks nice in theory but in practice we don't get any value from it. Even facebook does not run unit tests in the browser (hence why jest isn't supported in the browser). If a website with over a billion users is still functional without unit tests in the browser I'd say we can live without it as well.

Removing it reduces the number of false positive failures, CI runtime in general and just less code we have to maintain. All of this should improve productivity. Just the commit messages in the evaluated PRs speak for themself ("fix test (hopefully)" or "give up on test").

when does a test suite provide value?

The test suite has value if it finds bugs that other tests didn't find.

Methodology:

  • assumes test_browser failures are not fixed locally
  • for each PR check the first 100 commits and see if test_unit succeeded while test_browser did fail
  • repeat until GitHub API rejects further request because abuse

Results

In the last 6_000 PRs 129 had different results for test_browser and test_unit. I evaluated the 17 most recent of those 129 and found only 1 actual issue. Most of them were either flaky (restart fixed it) or the test needed to be implemented differently. I can evaluate more if the removal is contested.

Actual issue (react-testing-library):
react-testing-library was introduced after this PR. Can't use this as a pro for karma if you're against rtl.

Actual issue:

Indeterminate (e.g. force-push?)

Fixed by changing test:

Test skipped:

Flaky:

Not evaluated (yet; in those PRs docs:api is still part of test_browser):

Notes

With our methodology we will not find all builds with flaky test_browser because they are usually restarted manually from the CircleCI dashboard. Unless we can query failed CircleCI builds we have to sift PRs.
It also does not catch cancelled test_browser where test:karma failed but the job couldn't finish (should be very rare).

/cc @mui-org/core-contributors

Future work

Actual interactions should be tested in the browsers (e.g. keyboard a11y). I'm looking into cypress to test these more advanced scenarios.

TODO:

  • remove test_browser from required GitHub checks

@eps1lon eps1lon added the test label Jun 17, 2019
@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 17, 2019

(On a side note, browser support for Jest is the most requested feature of the project.)

@eps1lon eps1lon force-pushed the test/remove-karma branch from 5987776 to 19dec9e Compare June 17, 2019 11:44
@mui-pr-bot
Copy link

No bundle size changes comparing 4909bd8...19dec9e

Generated by 🚫 dangerJS against 19dec9e

@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 17, 2019

@eps1lon Thanks for linking the Facebook link. When I read this section:

To cover browser testing we do a lot of things ranging from web driver tests, manual QA and especially employee dogfooding. These things have in general provided more value than a flaky browser test engine that is hard to scale for thousands of tests.

I'm wondering, how will we cover browser testing? I could think of adding Sentry on the documentation website, but most of our traffic comes from Chrome. Ultimately it will be GitHub issues and release reactivity from our part?

Actual interactions should be tested in the browsers (e.g. keyboard a11y). I'm looking into cypress to test these more advanced scenarios.

I'm confused. Aren't the interaction tests the ones that are the flakiest? So is this about replacing karma with cypress then?


If the pain point is with flaky tests, I can think of a simple approach: automatic retries. We were facing this problem at Doctolib. hundreds of tests, a lot of flaky ones. In the first place, we have tried to make the test less flaky, but it was very time-consuming. We have fixed quite some flaky tests. But at some point, we introduced an auto-retry logic (while monitoring the tests that are retried the most often). It was the perfect soluting we needed at Doctolib. Is this something we can reproduce here?

I only know two cases of flaky browser tests, we could apply the following fixes:

  • The browser stack tunnel connection is interrupted (1. retry logic 2. remove the browser stack browsers, only keep the chrome headless).
  • The SwipeableDrawer test fails (1. look at why? 2.retry logic).

@oliviertassinari
Copy link
Member

cc @nathanmarks (who first introduced karma)

@eps1lon
Copy link
Member Author

eps1lon commented Jun 18, 2019

The whole point of all this data is that we dont do browser integration tests at the moment. karma does not test anything jsdom doesnt test. Yes I am working on tests with cypress but those would be new tests. If your takeaway here is that we loose some testing then you havent looked at the data

@joshwooding
Copy link
Member

I think I have been helped by karma twice most of the time it failed due to different css syntax or flaky tests. I think at the very least we could trial not having them. Nothing stops us from adding them back if we notice a regression.

@eps1lon
Copy link
Member Author

eps1lon commented Jun 18, 2019

Yeah if I say it doesnt do anything I mean the number of true positives is near zero. Compared with a high number of false positives it really hurts productivity

@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 18, 2019

If your takeaway here is that we lose some testing then you haven't looked at the data

I'm not challenging what we will gain. I'm challenging that cypress will be better than some of the other options we can try at a cheaper cost:

  • (browser stack flakiness) removing browser stack to only keep Chrome headless or add a retry logic on browser stack
  • (minimizing the number of tools) we already write integration tests with enzyme, in the future we can write them with react-testing-library. Unit and integration tests could be solved with the same tools.
  • (integration tests are flaky) cypress tests will be flaky too, we will need a retry logic no matter what?

@dmtrKovalenko works for Cypress, I'm sure you can provide his perspective (biased of course :).

@oliviertassinari
Copy link
Member

From my experience solving flaky tests at Doctolib, the problems were 50% of the time coming from tests not written correctly and 50% wrong core logic. Maybe investigating the flaky tests would be a valid direction too? I have seen one recently on a Menu timeout.

@eps1lon
Copy link
Member Author

eps1lon commented Jun 18, 2019

we already write integration tests with enzyme, in the future we can write them with react-testing-library. Unit and integration tests could be solved with the same tools.

I need to write it down at some point that this distinction makes no sense for a component library. Every component is a unit. There is not a single integration test here. Just unit tests with different APIs.

we already write integration tests with enzyme, in the future we can write them with react-testing-library.

The inherent issue with these is that they are designed for jsdom not an actual DOM. They are not fit for browser test because in a browser environment we don't mock events. We use the browser API for that.

cypress tests will be flaky too, we will need a retry logic no matter what?

Unsourced and therefore ignored from this argument.

@eps1lon eps1lon marked this pull request as ready for review June 18, 2019 11:57
@eps1lon
Copy link
Member Author

eps1lon commented Jun 18, 2019

options we can try at a cheaper cost:

Also: All of these still have karma in mind so the original statement seems still not accepted: Karma's value is dwarfed by the reduction in productivity. The cost is multiple orders of magnitude higher than the value gained. That is what I'm expressing in this PR. Not that there are other options we can explore. If we don't acknowledge this issue we don't have to explore other solutions.

@dmtrKovalenko
Copy link
Member

I am working for Cypress only because it is the only tool, (really only) that allows not flaky tests for browser.
You know I am trying to integrate cypress for material-ui-pickers. And it works well, but still some troubles on making test environment. (Right now using a separate page for that https://material-ui-pickers.dev/regression). Want to try https://github.com/bahmutov/cypress-react-unit-test by @bahmutov.

But anyway Cypress is really good choise, Especially because of the entierly free plan for open-source. With included unlimited parallelization and test recordings.

@oliviertassinari
Copy link
Member

@dmtrKovalenko Thanks for the feedback!

@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 18, 2019

Coming back to the pain points mentionned in the description of the pull request:

Removing it reduces the number of false positive failures

If I read your methodology correctly, we have a flakiness rate of +2%. We would like to go below 1%. If we agree, then I think that we should categories where these flaky tests come from. Having this data, we can then prioritize. If 80% of the flakiness comes from browserstack. We can kill it, problem solved. If 80% if the flakiness comes from the Menu or SwipeableDrawer tests, we can auto retry them. Problem solved.

CI runtime in general

I believe that removing browser stack would yield 80% of the value. At Doctolib, we killed it for this reason. It was initially introduced for screenrecording, but that was useless.

and just less code we have to maintain.

Will cypress be less code to maintain?

@eps1lon
Copy link
Member Author

eps1lon commented Jun 18, 2019

If I read your methodology correctly, we have a flakiness rate of +2%.

50% of the evaluated test failures were caused by flaky test.

If 80% of the flakiness comes from browserstack. We can kill it, problem solved.

And what is the purpose of karma at that point? It's just running tests twice at that point.

I believe that removing browser stack would yield 80% of the value.

No data, no rationale provided. Also not part of this argument.

We can kill it, problem solved. If 80% if the flakiness comes from the Menu or SwipeableDrawer tests, we can auto retry them. Problem solved.

It will further reduce CI times only worsening the problem. It's not as easy as you think it is.

Will cypress be less code to maintain?

It's starting again. This is not the scope of the PR. Please don't move the goal post and stay on topic. I urge you again don't put your opinion on a pedestal and expect everyone else to come to you. What actual value does the karma test provide? And not just your believes or feelings but actual hard data. Do the extra work please and not just discredit the work of others.

@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 18, 2019

50% of the evaluated test failures were caused by flaky test.

So over the last 6000 pull requests, we had 129 failures, and 50% of the failures were because of flakiness. So the current "flakiness rate" is 129 / 2 / 6000 = 1.075%. What number should we target? Is 1% a problem?

And what is the purpose of karma at that point? It's just running tests twice at that point.
What actual value does the karma test provide?

It provides the value that you have highlighted in the pull request description: "Actual interactions should be tested in the browsers" (Chrome headless, thanks to karma).

No data, no rationale provided. Also not part of this argument.

Let's breakdown the "Tests real browsers" test. It takes 20s to build the assets. The ChromeHeadless takes about 10s. The BrowserStack tests take about 1 minute. Assuming there is another 10s of misc CPU done. The "Tests real browsers" task takes about 20 + 60 + 10 = 90s.
Now, if we remove BrowserStack, it should be 20 + 10 + 10 = 40s. We win 50s. It's an x2.25 speedup. Can we consider that 1:30 is slow?

It will further reduce CI times only worsening the problem. It's not as easy as you think it is.

Will it reduce CI times 1% of the time? Why will it worsen the problem?

This is not the scope of the PR.

I think that we should have an holistic view of the problems we solve. I believe that how we test interactions in the browsers is key to the issue, hence we should consider Cypress part of the solution.
We are navigating the scope of the tradeoffs available maximizing these dimensions:

  • Fast CI feedback loop
  • Low flakiness rate
  • Low Maintainer overhead (long term)
  • Low Cost of change (short term)
  • Avoid regression
  • Maximize test coverage

As long as we carefully consider them, I'm happy.

@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 18, 2019

I would also add Jest as part of the equation. I believe that Karma is preventing us from using it. It's worth mentioning.

@eps1lon
Copy link
Member Author

eps1lon commented Jun 20, 2019

Alright different proposal:

jsdom for "unit" tests i.e. the tests that live next to the component, karma for integration tests when jsdom is insufficient. It was a common pattern that I observed when I evaluated the failures: Some assertion was made based on jsdom implementations that don't hold up in the browser.

Basically let test:unit only run src/**/*.test.jsand let karma run ontest/**/*.test.js`. We kill test duplication between jsdom and actual DOM and avoid writing tests for the jsdom.

@oliviertassinari
Copy link
Member

@eps1lon Do you have an example of this distinction between a unit and integration test?

@dmtrKovalenko
Copy link
Member

I am sorry to getting into your discussion. But. I am sure that for the components library - any test that runs in the browser will be end-to-end. Anything else is integration and unit.

This is kinda philosophical theory - the distinction between test types :D

@eps1lon
Copy link
Member Author

eps1lon commented Jun 20, 2019

@eps1lon Do you have an example of this distinction between a unit and integration test?

I'm not a fan of this distinction for a component library. Because components are already a unit. Or the whole app if you think of it in terms of public vs. private.

My main concern is more that anything that heavily relies on jsdom should only run in the browser. Especially anything related to event dispatching tends be very far removed from what's actually happening in the browser.

I'd probably go with render1 output => jsdom tests, interactions => browser. And for browser test you can make some trade offs if you don't have the infra for actual e2e tests e.g. dispatch click events or keydown events.

1 Basically markdown only. Styles support is very limited in jsdom.

any test that runs in the browser will be end-to-end

As soon as you're using imperative DOM APIs to act you're not doing e2e tests. The user isn't using the DOM API but the browsers. For example anytime you're writing a test where you dispatch a change event you no longer have a full picture of what's happening. In an actual e2e test you would click into the textfield and start typing.

@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 21, 2019

From my perspective, we shouldn't make any distinction, we shouldn't make an empirical distinction between when a test should run in a browser or not. I think that the simplest option is to run everything in the browser. Our +2000 tests run in 10s in Chrome. It's nothing, we had 65 minutes of e2e tests at Doctolib, tests we had to split between 15 parallel workers.

@eps1lon
Copy link
Member Author

eps1lon commented Jun 21, 2019

we shouldn't make an empirical distinction between when a test should run in a browser or not.

We could do that but we still only have the jsdom. Until something better comes along what's important in the browser will always be brittle. The rest is redundant.

Still missing an argument for browser test. It's never been made theoretically nor empirically.

@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 22, 2019

@eps1lon They shouldn't be any argument for browser test, we need them, it's not an option. The argument should be on: Is JSDOM good enough to approximate browser tests? Not always but does it worth it the tradeoff?

From my perspective, running the browser tests is a low hanging fruit. It doesn't cost much (<1% rate of flaky test right now, + the potential to improve it with retries or by dropping).
We can't pick the same tradeoff as Facebook without considering the context. We don't have the same canary release control as they have. They can deploy to 0.01% of the population, have instant bug feedback and revert. They can have 30k employees report problems on unreleased versions. We can't. They have significantly more tests than we have (x100-x1000?).
It also works on people psychology, a user will feel safer seeing the library he/she depends on having tests in the browsers. The perceived value is as important as the intrinsic value.

@eps1lon
Copy link
Member Author

eps1lon commented Jun 22, 2019

They shouldn't be any argument for browser test, we need them, it's not an option

Says who?

@oliviertassinari
Copy link
Member

oliviertassinari commented Jun 22, 2019

Says who?

Our users, using the components in a browser?

@joshwooding
Copy link
Member

I’ve been thinking about the browser test situation, I think we should probably keep them. Stuff like style specific testing should probably only be tested on jsdom or removed and the rest tested on actual browsers. As a UI library it makes sense to have browser tests.

@oliviertassinari
Copy link
Member

There is probably a better test architecture that could deploy. There is no question that by investing more time, we could improve our tests.
However, I believe that we have now is good enough. I don't think that we should invest more time on it. Let's build more features instead.

@oliviertassinari
Copy link
Member

Thanks, guys for the discussion!

@eps1lon
Copy link
Member Author

eps1lon commented Jun 25, 2019

Thanks, guys for the discussion!

There was none. There was evidence from me and some claims from other people that weren't justified. Dangerous precedent but maybe I can do the same.

@eps1lon eps1lon deleted the test/remove-karma branch June 25, 2019 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants