Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discover] Unskip functional tests for field visualize buttons #62614

Conversation

kertal
Copy link
Member

@kertal kertal commented Apr 6, 2020

Summary

This PR unskips discover_spaces and discover_security functional tests. While the implementation of these tests were fine, they were flaky, because the initial request of the given time range in Discover sometimes returned no data. Therefore no fields in the sidebar were displayed, and no Visualize button was available.

This was solved with #64155, solving an issue in async search

Here's the flaky test suite runner to prove it's no longer flaky
https://kibana-ci.elastic.co/job/kibana+flaky-test-suite-runner/393/

fixes #60539
fixes #60535

@kertal kertal added Feature:Discover Discover Application release_note:skip Skip the PR/issue when compiling release notes labels Apr 6, 2020
@kertal kertal self-assigned this Apr 7, 2020
@kertal
Copy link
Member Author

kertal commented Apr 10, 2020

@elasticmachine merge upstream

@kertal kertal marked this pull request as ready for review April 11, 2020 08:37
@kertal kertal requested review from a team April 11, 2020 08:39
Copy link
Member

@legrego legrego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for unskipping these!

await PageObjects.discover.expectMissingFieldListItemVisualize('bytes');
await retry.try(async () => {
await setDiscoverTimeRange();
const hasNoResults = await PageObjects.discover.hasNoResults();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would generally try to avoid checking for something to not exist since it takes a timeout of 10 seconds or so. Compared to checking for something that should exist like the hit count. Don't change anything yet. I'm going to run these tests locally and see if I have a suggestion for a change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code as it retries a couple of times to make sure it's not on the "no results" page (it's not the default timeout, but a 2500ms timeout) takes about 3 seconds;

[11:39:34.333150000] │ debg TestSubjects.exists(discoverNoResults)
[11:39:34.393075600] │ debg Find.existsByDisplayedByCssSelector('[data-test-subj="discoverNoResults"]') with timeout=2500
[11:39:34.704008100] │ debg --- retry.tryForTime error: [data-test-subj="discoverNoResults"] is not displayed
[11:39:35.236323000] │ debg --- retry.tryForTime failed again with the same message...
[11:39:35.761950200] │ debg --- retry.tryForTime failed again with the same message...
[11:39:36.293043400] │ debg --- retry.tryForTime failed again with the same message...
[11:39:36.829574800] │ debg --- retry.tryForTime failed again with the same message...
[11:39:37.354057600] │ debg TestSubjects.click(field-bytes)

vs getting the hitCount and verifying it's > 0 takes about .2 seconds

[11:54:10.954650700] │ debg TestSubjects.getVisibleText(discoverQueryHits)
[11:54:11.002496700] │ debg TestSubjects.find(discoverQueryHits)
[11:54:11.052031400] │ debg Find.findByCssSelector('[data-test-subj="discoverQueryHits"]') with timeout=10000
[11:54:11.113423700] │ debg TestSubjects.click(field-bytes)
@@ -187,8 +187,9 @@ export default function({ getPageObjects, getService }: FtrProviderContext) {
         await PageObjects.common.navigateToApp('discover');
         await retry.try(async () => {
           await setDiscoverTimeRange();
-          const hasNoResults = await PageObjects.discover.hasNoResults();
-          expect(hasNoResults).to.be(false);
+          const hitCount = await PageObjects.discover.getHitCount();
+          // eslint-disable-next-line radix
+          expect(parseInt(hitCount)).to.be.greaterThan(0);

           await PageObjects.discover.clickFieldListItem('bytes');
           await PageObjects.discover.expectMissingFieldListItemVisualize('bytes');
@@ -281,8 +282,10 @@ export default function({ getPageObjects, getService }: FtrProviderContext) {
         await PageObjects.common.navigateToApp('discover');
         await retry.try(async () => {
           await setDiscoverTimeRange();
-          const hasNoResults = await PageObjects.discover.hasNoResults();
-          expect(hasNoResults).to.be(false);
+          const hitCount = await PageObjects.discover.getHitCount();
+          // eslint-disable-next-line radix
+          expect(parseInt(hitCount)).to.be.greaterThan(0);
+
           await PageObjects.discover.clickFieldListItem('bytes');
           await PageObjects.discover.expectMissingFieldListItemVisualize('bytes');
         });
@@ -362,8 +365,9 @@ export default function({ getPageObjects, getService }: FtrProviderContext) {
         await PageObjects.common.navigateToApp('discover');
         await retry.try(async () => {
           await setDiscoverTimeRange();
-          const hasNoResults = await PageObjects.discover.hasNoResults();
-          expect(hasNoResults).to.be(false);
+          const hitCount = await PageObjects.discover.getHitCount();
+          // eslint-disable-next-line radix
+          expect(parseInt(hitCount)).to.be.greaterThan(0);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I just realized another potential issue with this change.

In other Discover tests we've used a retry only around getting the hit count and comparing it to the expected value. We didn't include setting the time range in the retry because each time you set the timepicker it's going to reload the page, and it's the page loading we're waiting for with the retry.

From the failing test issue you said
"the screenshot of the failed test is telling me, no data available, expand your time range. that's odd"

Did the screenshot show the expected start and end dates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @LeeDr , back today, I'll soon provide feeback

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the screenshot, yes it's showing the defined time range, but no data:
image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run a similar test suite in OSS for debugging the issue, it's wasn't flaky there:
https://kibana-ci.elastic.co/job/kibana+flaky-test-suite-runner/339/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dear @LeeDr, wonder how to proceed here?

Maybe switch to const hitCount = await PageObjects.discover.getHitCount(); , since this fixes the test, and open another issue because of the flaky data fetching to investigate?

Thx!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it pretty concerning that the screenshot shows the correct dates in the timepicker and no results?!?! It could still just be a timing issue that the results just haven't come back in the response yet, but the pink loading bar isn't there either so that doesn't feel right.

I'm looking at the flaky-test-suite-runner output now....

FYI, here's an example of a test where we only put the getHitCount() in the retry because it's waiting for the response from Elasticsearch and for the page to load that data; https://github.com/elastic/kibana/blob/master/test/functional/apps/discover/_discover.js#L74

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LeeDr I've adapted the code, removing setDiscoverTimerRange() of try.retry, now the flaky suite is flaky (1 of 44)
https://kibana-ci.elastic.co/job/kibana+flaky-test-suite-runner/367/

image

can I search the logs on server? because in Jenkins it's hard to search the logs, it says, no test failed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @lukasolson (any other thoughts on this?)

I had a couple of thoughts on debugging this while running the test locally.

We could turn on Elasticsearch slowlogs on both the logstash-* and .async-search indices. I don't see that we've done that in any existing tests yet. It's a per-index setting. Seems like it would have to be done after esArchiver.loadIfNeeded('logstash_functional');. But the slowlog only shows the query, not the response. So this might not help in debugging the issue.

Another thing you could try, is if we fail to find hit count, or if we do find the "no results" page, is to try to open the inspector and capture the request and response. It could show that either the query sent was wrong, or the query was right and Elasticsearch didn't return the correct response, or the correct response was returned and Discover didn't display it.

Or temporarily add debug logging to output the query and response to the Kibana log.

Copy link

@LeeDr LeeDr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I didn't pull the latest commits in this PR to run locally but the changes are in line with what we've done in other tests (after setting the timepicker, user retry loop to wait for results in Discover). And Jenkins passed.

@kertal kertal requested a review from a team as a code owner April 17, 2020 18:29
@@ -69,7 +69,7 @@ async function asyncSearch(
const path = encodeURI(request.id ? `/_async_search/${request.id}` : `/${index}/_async_search`);

// Wait up to 1s for the response to return
const query = toSnakeCase({ waitForCompletionTimeout: '1s', ...queryParams });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the issue with the tests should be resolved by retrying in the tests, not increasing the initial waitForCompletionTimeout. Isn't that so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last commit was to test if an increase of the waitForCompletionTimeout solves the flakiness of the tests, it does:

https://kibana-ci.elastic.co/job/kibana+flaky-test-suite-runner/378/

so there are 2 approaches her to solve this: increase the timeout oder retry the test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukasolson @lizozom @lukasolson Question is, why the user, or in this case the test is getting the message, that there're no results matching this criteria. In this case there are, but it took longer than the waitForCompletionTimeout, shouldn't it continue searching in this case with GET async search? If the system is for some reason slower, that's what happening her, it shouldn't feedback that there're no result.

Copy link
Member Author

@kertal kertal Apr 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I get when I start an expensive search in my 7.7 staging cluster, wildcard search over 50mio records:

Bildschirmfoto 2020-04-20 um 11 55 28

It did continue while I was writing this message, and I suddenly got the following screen, shouldn't there be a message, that it timed out?

Bildschirmfoto 2020-04-20 um 11 59 00

@LeeDr
Copy link

LeeDr commented Apr 20, 2020

1 second seems like too short of a timeout if I understand the impact correctly. Everyone loves a fast search result. But I don't think a typical user would care too much if a query took 2 or 3 seconds. I don't think they would want to be bothered with a dialog they have to click every time a query takes more than a second.

I thought this mechanism was going to be around the default 30 second timeout mark or somewhere just short of that?

@kertal
Copy link
Member Author

kertal commented Apr 20, 2020

1 second seems like too short of a timeout if I understand the impact correctly. Everyone loves a fast search result. But I don't think a typical user would care too much if a query took 2 or 3 seconds. I don't think they would want to be bothered with a dialog they have to click every time a query takes more than a second.
I thought this mechanism was going to be around the default 30 second timeout mark or somewhere just short of that?

@LeeDr This popup wasn't displayed after a second, it was behaving correctly. However when the popup disappeared the "No results match your search criteria" screen was displayed, and that's what also the same behavior I recognized in the tests. In the sync search, when you run into a timeout, there's an error message:

Bildschirmfoto 2020-04-20 um 18 34 40

Async search timeout seems to fail silently, and are therefore much harder to debug

@kertal
Copy link
Member Author

kertal commented Apr 20, 2020

@lukasolson @lizozom @lukasolson I could reproduce that behavior in a cluster with a large data set and an expensive query, I think we should increase waitForCompletionTimeout

image

@LeeDr
Copy link

LeeDr commented Apr 20, 2020

  • Let's make sure we run this test against Cloud before merging so we don't end up adding a flaky test there. I can help.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Discover Discover Application Feature:Functional Testing release_note:skip Skip the PR/issue when compiling release notes v7.7.0 v7.8.0 v8.0.0
Projects
None yet
6 participants