-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClassificationIT testSetUpgradeMode_ExistingTaskGetsUnassigned Failure #55221
Comments
Pinging @elastic/ml-core (:ml) |
I'm already working on it, reassigned to myself. |
In the server logs it seems that changing the value of upgrade mode happens "concurrently" with creating
Other than that, there is no indication of upgrade mode-related failure in the logs. Still the test fails as it cannot gather task stats.
|
Thanks for muting @mark-vieira, we were collecting extra debug information and were about to mute |
Is there some kind of annotation that will still execute a test, but not trigger a failure? If not, perhaps it's worth adding such a thing? |
Sounds like a good idea for all the "add more logging" cases. I'll try to find out if it's there. |
It is a very interesting idea as leaving the tests unmuted with extra debug logging is a common workflow. I took a look at But, without the build failure how would I know that my flaky test had failed again? I would have to check the logs of every build for the assumption exception. There is a conflict between suppressing the failure allowing the build to succeed and being notified of the failure I can't see a way to have both. Unless we track assumption failures which would just complicated things as certain tests assume there are running on certain platforms etc. |
That's a good question. This is a good use case for retrying tests. In this scenario we still get the failures and we identify the test as being flakey if the passes on subsequent runs. There's first-class support for this in Gradle, and it's on my todo list to investigate how we might integrate it into the build. Right now we aggressively mute builds to reduce noise but this does have the side effect that the data collected from failures is potentially useful. |
I also muted this in 7.x in 5de6ddf A recent failure is https://gradle-enterprise.elastic.co/s/qffj2qikw45by so hopefully that contains the extra debug that was added. |
A possibly similar failure: https://gradle-enterprise.elastic.co/s/c7hhi3j24h7o6 I am going to mute ClassificationIT.testSetUpgradeMode_NewTaskDoesNotStart as well |
Investigating... |
The failure in the logs is:
This particular test failure is irrelevant to this issue. The only reason it appears is that it expects no DFA tasks at some point and there still exists a task created by a different test ( |
This one is relevant though. I've found the additional debug message in the logs, it states that:
It's not very enlightening, but after looking at the code where the
I'm still figuring out why it's the upgrade mode tests that suffer from this as there are other tests calling |
Ok, I think I see the difference. All the other tests that use
So in order to unblock the upgrade mode test, I can adopt the approach (2.) i.e. wrap the The question still remains whether it is ok for the |
Since this is what all the other tests do it's fair enough to fix the immediate problem in the same way. However, this isn't very user friendly as it means a programmatic user of our APIs would have to do the same thing: loop until success if they wanted to get stats immediately after starting a job. Some of our endpoints provide guarantees about what will work immediately afterwards if you are writing a program that calls a sequence of our endpoints one after the other. For example, with anomaly detection jobs if you get results immediately after calling close or flush then you are guaranteed to get the results from all data submitted before the close or flush. I wonder if our start data frame analytics job and open anomaly detection job endpoints should provide the guarantee that you can ask for stats immediately after the start/open call returns and:
I think the problem of getting an error is probably more likely for data frame analytics jobs than anomaly detection jobs because data frame analytics jobs create two indices on first start (stats and state) whereas anomaly detection jobs only create one (state). But theoretically the problem could affect both. So I think in the long term we should wait for yellow status on the indices we create as part of start/open before we set the state to started/opened. |
Ok, I've just sent a PR for review that does just that.
Exactly. That's why I raised this...
Yes, and both indices are needed in the stats call.
I'm ok with that as long as the user is not surprised that their first stats call will be (much) longer than the other ones. |
I think we should wait in the start/open call, not the stats call. The start/open call might just have created the indices, so there's a legitimate reason to think that if they're not yellow status they will be soon. But for the stats call if we waited every time then we could be waiting for fatal problems that require administrator intervention to resolve. |
Ah, yeah. Of course that would make much more sense. |
There have been a few test failures that are likely caused by tests performing actions that use ML indices immediately after the actions that create those ML indices. Currently this can result in attempts to search the newly created index before its shards have initialized. This change makes the method that creates the internal ML indices that have been affected by this problem (state and stats) wait for the shards to be initialized before returning. Fixes elastic#54887 Fixes elastic#55221 Fixes elastic#55807 Fixes elastic#57102 Fixes elastic#58841 Fixes elastic#59011
…#59027) There have been a few test failures that are likely caused by tests performing actions that use ML indices immediately after the actions that create those ML indices. Currently this can result in attempts to search the newly created index before its shards have initialized. This change makes the method that creates the internal ML indices that have been affected by this problem (state and stats) wait for the shards to be initialized before returning. Fixes #54887 Fixes #55221 Fixes #55807 Fixes #57102 Fixes #58841 Fixes #59011
https://gradle-enterprise.elastic.co/s/75oonfpb3j3mw
The text was updated successfully, but these errors were encountered: