Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClassificationIT testSetUpgradeMode_ExistingTaskGetsUnassigned Failure #55221

Closed
davidkyle opened this issue Apr 15, 2020 · 19 comments · Fixed by #59027
Closed

ClassificationIT testSetUpgradeMode_ExistingTaskGetsUnassigned Failure #55221

davidkyle opened this issue Apr 15, 2020 · 19 comments · Fixed by #59027
Labels
:ml Machine learning >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@davidkyle
Copy link
Member

https://gradle-enterprise.elastic.co/s/75oonfpb3j3mw

org.elasticsearch.ElasticsearchException: all shards failedClose stacktrace
at __randomizedtesting.SeedInfo.seed([B5428DD8494405BC:568FC3C8087049D8]:0)
at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:55)
at org.elasticsearch.xpack.ml.action.TransportGetDataFrameAnalyticsStatsAction.lambda$searchStats$8(TransportGetDataFrameAnalyticsStatsAction.java:216)
./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testSetUpgradeMode_ExistingTaskGetsUnassigned" -Dtests.seed=B5428DD8494405BC -Dtests.security.manager=true -Dtests.locale=es-AR -Dtests.timezone=America/Argentina/Ushuaia -Dcompiler.java=14
HomeTestsorg.elasticsearch.xpack.ml.integration.ClassificationIT » testSetUpgradeMode_ExistingTaskGetsUnassigned
@davidkyle davidkyle added >test-failure Triaged test failures from CI :ml Machine learning v8.0.0 labels Apr 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@przemekwitek przemekwitek self-assigned this Apr 15, 2020
@przemekwitek
Copy link
Contributor

I'm already working on it, reassigned to myself.

@przemekwitek
Copy link
Contributor

przemekwitek commented Apr 15, 2020

In the server logs it seems that changing the value of upgrade mode happens "concurrently" with creating .ml-stats index:

1854 [2020-04-15T07:16:26,442][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Starting to set [upgrade_mode] to [true] from [false]
1855 [2020-04-15T07:16:26,457][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-2] applying create index request using v1 templates [{".ml-state":{"order":0,"version":8000099,"index_patterns":[".ml-s     tate*"],"settings":{"index":{"hidden":"true","auto_expand_replicas":"0-1"}},"mappings":{"_doc":{"_meta":{"version":"8000099"},"enabled":false}},"aliases":{}}}]
1856 [2020-04-15T07:16:26,468][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-2] [.ml-state-000001] creating index, cause [api], templates [.ml-state], shards [1]/[1], mappings [_doc]
1857 [2020-04-15T07:16:26,509][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Executing cluster state update
1858 [2020-04-15T07:16:26,541][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Cluster update response built: true
1859 [2020-04-15T07:16:26,542][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Enabling upgrade mode, must isolate datafeeds
1860 [2020-04-15T07:16:26,542][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Isolating datafeeds: []
1861 [2020-04-15T07:16:26,543][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Isolated the datafeeds
1862 [2020-04-15T07:16:26,544][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Un-assigning persistent tasks : [ data_frame_analytics-classification_set_upgrade_mode ]
1863 [2020-04-15T07:16:26,642][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-2] applying create index request using v1 templates [{".ml-stats":{"order":0,"version":8000099,"index_patterns":[".ml-s     tats-*"],"settings":{"index":{"hidden":"true","number_of_shards":"1","auto_expand_replicas":"0-1"}},"mappings":{"_doc":{"_meta":{"version":"8.0.0"},"dynamic":false,"properties":{"iteration":{"type":"     integer"},"hyperparameters":{"properties":{"class_assignment_objective":{"type":"keyword"},"downsample_factor":{"type":"double"},"eta":{"type":"double"},"eta_growth_rate_per_tree":{"type":"double"},"     feature_bag_fraction":{"type":"double"},"max_attempts_to_add_tree":{"type":"integer"},"max_optimization_rounds_per_hyperparameter":{"type":"integer"},"max_trees":{"type":"integer"},"num_folds":{"type     ":"integer"},"num_splits_per_feature":{"type":"integer"},"regularization_depth_penalty_multiplier":{"type":"double"},"regularization_leaf_weight_penalty_multiplier":{"type":"double"},"regularization_     soft_tree_depth_limit":{"type":"double"},"regularization_soft_tree_depth_tolerance":{"type":"double"},"regularization_tree_size_penalty_multiplier":{"type":"double"}}},"job_id":{"type":"keyword"},"pa     rameters":{"properties":{"compute_feature_influence":{"type":"boolean"},"feature_influence_threshold":{"type":"double"},"method":{"type":"keyword"},"n_neighbors":{"type":"integer"},"outlier_fraction"     :{"type":"double"},"standardization_enabled":{"type":"boolean"}}},"peak_usage_bytes":{"type":"long"},"model_id":{"type":"keyword"},"node_id":{"type":"keyword"},"inference_count":{"type":"long"},"fail     ure_count":{"type":"long"},"missing_all_fields_count":{"type":"long"},"skipped_docs_count":{"type":"long"},"timestamp":{"type":"date"},"timing_stats":{"properties":{"elapsed_time":{"type":"long"},"it     eration_time":{"type":"long"}}},"test_docs_count":{"type":"long"},"training_docs_count":{"type":"long"},"type":{"type":"keyword"},"validation_loss":{"properties":{"fold_values":{"properties":{"fold":     {"type":"integer"},"values":{"type":"double"}}},"loss_type":{"type":"keyword"}}}}}},"aliases":{}}}]
1864 [2020-04-15T07:16:26,651][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-2] [.ml-stats-000001] creating index, cause [api], templates [.ml-stats], shards [1]/[1], mappings [_doc]
1865 [2020-04-15T07:16:26,681][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Waited for tasks to be unassigned
1866 [2020-04-15T07:16:26,681][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Completed upgrade mode request
1867 [2020-04-15T07:16:26,701][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Starting to set [upgrade_mode] to [false] from [true]
1868 [2020-04-15T07:16:26,702][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Executing cluster state update
1869 [2020-04-15T07:16:26,730][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Cluster update response built: true
1870 [2020-04-15T07:16:26,730][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Disabling upgrade mode, must wait for tasks to not have AWAITING_UPGRADE assignment
1871 [2020-04-15T07:16:26,811][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Done waiting for tasks to be out of AWAITING_UPGRADE
1872 [2020-04-15T07:16:26,812][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-2] Completed upgrade mode request

Other than that, there is no indication of upgrade mode-related failure in the logs. Still the test fails as it cannot gather task stats.
I have two hypotheses so far:

  1. the failure is related to the fact that we are trying to search the index (.ml-stats-000001) that is still being created.
  2. the failure is related to the fact that we are trying to gather stats for the task that is just being unassigned.

@mark-vieira
Copy link
Contributor

This test has failed many times over the past 24 hours. As a point of process, if we see multiple failures in a single day we should automatically mute.

On that note, I've muted this test in master (ecc8599) and 7.x (ba0a39c)

@davidkyle
Copy link
Member Author

Thanks for muting @mark-vieira, we were collecting extra debug information and were about to mute

@mark-vieira
Copy link
Contributor

Thanks for muting @mark-vieira, we were collecting extra debug information and were about to mute

Is there some kind of annotation that will still execute a test, but not trigger a failure? If not, perhaps it's worth adding such a thing?

@przemekwitek
Copy link
Contributor

Is there some kind of annotation that will still execute a test, but not trigger a failure? If not, perhaps it's worth adding such a thing?

Sounds like a good idea for all the "add more logging" cases. I'll try to find out if it's there.

@davidkyle
Copy link
Member Author

Is there some kind of annotation that will still execute a test, but not trigger a failure? If not, perhaps it's worth adding such a thing?

It is a very interesting idea as leaving the tests unmuted with extra debug logging is a common workflow.

I took a look at JUnit TestRules which I think could be used to catch assertion failures and re-throw them as AssumptionViolatedExceptions. Assumptions do not fail the test suite so we would have our debugged test failure without the build failure.

But, without the build failure how would I know that my flaky test had failed again? I would have to check the logs of every build for the assumption exception. There is a conflict between suppressing the failure allowing the build to succeed and being notified of the failure I can't see a way to have both. Unless we track assumption failures which would just complicated things as certain tests assume there are running on certain platforms etc.

@mark-vieira
Copy link
Contributor

But, without the build failure how would I know that my flaky test had failed again?

That's a good question. This is a good use case for retrying tests. In this scenario we still get the failures and we identify the test as being flakey if the passes on subsequent runs. There's first-class support for this in Gradle, and it's on my todo list to investigate how we might integrate it into the build.

Right now we aggressively mute builds to reduce noise but this does have the side effect that the data collected from failures is potentially useful.

@droberts195
Copy link
Contributor

I also muted this in 7.x in 5de6ddf

A recent failure is https://gradle-enterprise.elastic.co/s/qffj2qikw45by so hopefully that contains the extra debug that was added.

@benwtrent
Copy link
Member

A possibly similar failure: https://gradle-enterprise.elastic.co/s/c7hhi3j24h7o6 I am going to mute ClassificationIT.testSetUpgradeMode_NewTaskDoesNotStart as well

@przemekwitek
Copy link
Contributor

A recent failure is https://gradle-enterprise.elastic.co/s/qffj2qikw45by so hopefully that contains the extra debug that was added.

Investigating...

@przemekwitek
Copy link
Contributor

przemekwitek commented Apr 17, 2020

A possibly similar failure: https://gradle-enterprise.elastic.co/s/c7hhi3j24h7o6 I am going to mute ClassificationIT.testSetUpgradeMode_NewTaskDoesNotStart as well

The failure in the logs is:

java.lang.AssertionError: 
Expected: is an empty collection
     but: <[{"id":"data_frame_analytics-test_outlier_detection_stop_and_restart","task":{"xpack/ml/data_frame/analytics":{"params":{"id":"test_outlier_detection_stop_and_restart","version":"7.8.0","progress_on_start":[{"phase":"reindexing","progress_percent":0},{"phase":"loading_data","progress_percent":0},{"phase":"analyzing","progress_percent":0},{"phase":"writing_results","progress_percent":0}],"allow_lazy_start":false},"state":{"state":"failed","allocation_id":13,"reason":"Partial shards failure"}}},"allocation_id":160,"assignment":{"executor_node":null,"explanation":"persistent task cannot be assigned while upgrade mode is enabled."},"allocation_id_on_last_status_update":13}]>

This particular test failure is irrelevant to this issue. The only reason it appears is that it expects no DFA tasks at some point and there still exists a task created by a different test (data_frame_analytics-test_outlier_detection_stop_and_restart) than the one that failed (testSetUpgradeMode_NewTaskDoesNotStart).

@przemekwitek
Copy link
Contributor

przemekwitek commented Apr 17, 2020

A recent failure is https://gradle-enterprise.elastic.co/s/qffj2qikw45by so hopefully that contains the extra debug that was added.

This one is relevant though. I've found the additional debug message in the logs, it states that:

21097 [2020-04-15T23:50:35,391][ERROR][o.e.x.m.a.TransportGetDataFrameAnalyticsStatsAction] [integTest-0] [classification_set_upgrade_mode] Item failure encountered during multi search: all shards failed
21098 org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
21099     at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21100     at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21101     at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:580) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21102     at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21103     at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:68) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21104     at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:245) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21105     at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21106     at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21107     at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:402) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21108     at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:639) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21109     at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1171) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21110     at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1280) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21111     at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1254) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21112     at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:60) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21113     at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:56) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21114     at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:413) ~[elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21115     at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21116     at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) [elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21117     at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.8.0-SNAPSHOT.jar:7.8.0-SNAPSHOT]
21118     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_241]
21119     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_241]
21120     at java.lang.Thread.run(Thread.java:748) [?:1.8.0_241]

It's not very enlightening, but after looking at the code where the .ml-state* and .ml-stats* indices are created it seems to me that there can be the following interleaving happening:

  1. StartDataFrameAnalyticsAction is executed by the test
  2. A call to DataFrameAnalyticsManager.execute is made. This call makes sure .ml-state* and .ml-stats* indices exist, effectively creating them (but without waiting for them to have assigned shards)
  3. GetDataFrameAnalyticsStatsAction is executed by the test. The action fails because it tries to search .ml-state* and .ml-stats* indices that have no assigned shards.
  4. The whole test fails.

I'm still figuring out why it's the upgrade mode tests that suffer from this as there are other tests calling GetDataFrameAnalyticsStatsAction action.

@przemekwitek
Copy link
Contributor

przemekwitek commented Apr 17, 2020

I'm still figuring out why it's the upgrade mode tests that suffer from this as there are other tests calling GetDataFrameAnalyticsStatsAction action.

Ok, I think I see the difference. All the other tests that use GetDataFrameAnalyticsStatsAction in assertions do it:

  1. either after waitUntilAnalyticsIsStopped(jobId); so the analytics is then in stable state (and all the required indices have alraeady been created)
  2. or inside the assertBusy wrapper which waits until there are no failures.

So in order to unblock the upgrade mode test, I can adopt the approach (2.) i.e. wrap the GetDataFrameAnalyticsStatsAction call in assertBusy.

The question still remains whether it is ok for the GetDataFrameAnalyticsStatsAction to fail only because the .ml-state* or .ml-stats* index is currently being created. Should the action wait for these indices to be yellow? This would cause some stats calls to be much longer than the majority.
@benwtrent , @droberts195 : WDYT?

@droberts195
Copy link
Contributor

wrap the GetDataFrameAnalyticsStatsAction call in assertBusy.

Since this is what all the other tests do it's fair enough to fix the immediate problem in the same way.

However, this isn't very user friendly as it means a programmatic user of our APIs would have to do the same thing: loop until success if they wanted to get stats immediately after starting a job.

Some of our endpoints provide guarantees about what will work immediately afterwards if you are writing a program that calls a sequence of our endpoints one after the other. For example, with anomaly detection jobs if you get results immediately after calling close or flush then you are guaranteed to get the results from all data submitted before the close or flush.

I wonder if our start data frame analytics job and open anomaly detection job endpoints should provide the guarantee that you can ask for stats immediately after the start/open call returns and:

  1. Not get an error
  2. See the job you just started/opened in the stats

I think the problem of getting an error is probably more likely for data frame analytics jobs than anomaly detection jobs because data frame analytics jobs create two indices on first start (stats and state) whereas anomaly detection jobs only create one (state). But theoretically the problem could affect both.

So I think in the long term we should wait for yellow status on the indices we create as part of start/open before we set the state to started/opened.

@przemekwitek
Copy link
Contributor

Since this is what all the other tests do it's fair enough to fix the immediate problem in the same way.

Ok, I've just sent a PR for review that does just that.

However, this isn't very user friendly as it means a programmatic user of our APIs would have to do the same thing: loop until success if they wanted to get stats immediately after starting a job.

Exactly. That's why I raised this...

I think the problem of getting an error is probably more likely for data frame analytics jobs than anomaly detection jobs because data frame analytics jobs create two indices on first start (stats and state)

Yes, and both indices are needed in the stats call.

So I think in the long term we should wait for yellow status on the indices we create as part of start/open before we set the state to started/opened.

I'm ok with that as long as the user is not surprised that their first stats call will be (much) longer than the other ones.

@droberts195
Copy link
Contributor

I'm ok with that as long as the user is not surprised that their first stats call will be (much) longer than the other ones.

I think we should wait in the start/open call, not the stats call.

The start/open call might just have created the indices, so there's a legitimate reason to think that if they're not yellow status they will be soon. But for the stats call if we waited every time then we could be waiting for fatal problems that require administrator intervention to resolve.

@przemekwitek
Copy link
Contributor

I think we should wait in the start/open call, not the stats call.

Ah, yeah. Of course that would make much more sense.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jul 3, 2020
There have been a few test failures that are likely caused by tests
performing actions that use ML indices immediately after the actions
that create those ML indices.  Currently this can result in attempts
to search the newly created index before its shards have initialized.

This change makes the method that creates the internal ML indices
that have been affected by this problem (state and stats) wait for
the shards to be initialized before returning.

Fixes elastic#54887
Fixes elastic#55221
Fixes elastic#55807
Fixes elastic#57102
Fixes elastic#58841
Fixes elastic#59011
droberts195 added a commit that referenced this issue Jul 6, 2020
…#59027)

There have been a few test failures that are likely caused by tests
performing actions that use ML indices immediately after the actions
that create those ML indices.  Currently this can result in attempts
to search the newly created index before its shards have initialized.

This change makes the method that creates the internal ML indices
that have been affected by this problem (state and stats) wait for
the shards to be initialized before returning.

Fixes #54887
Fixes #55221
Fixes #55807
Fixes #57102
Fixes #58841
Fixes #59011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants