ML: update set_upgrade_mode, add logging #38372

benwtrent · 2019-02-04T22:24:31Z

There have been some flaky tests in the set_upgrade_mode yml tests. This PR adds some logging (to aid future debugging) and hopefully addresses the flakiness of the tests.

I have not been able to reproduce this scenario: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=immutable&&linux&&docker/221/console

from what I can tell in the logs, the datafeed was stopped

./x-pack/plugin/build/cluster/integTestCluster node0/elasticsearch-6.7.0-SNAPSHOT/logs/x-pack_plugin_integTestCluster.log:[2019-02-03T23:49:03,257][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] try lock [20s] to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]...

The no_realtime indicates that for some reason the Datafeed was not being treated as a realtime job. To combat, I am adding a index health check + logs to aid in troubleshooting again in the future.

elasticmachine · 2019-02-04T22:24:33Z

Pinging @elastic/ml-core

benwtrent · 2019-02-05T00:29:49Z

run elasticsearch-ci/1

droberts195

LGTM

On further analysis I can see that the problem is not fixed

droberts195 · 2019-02-05T10:57:07Z

I've had a closer look at what's happening here, and have spotted a bug that is the most likely cause. It's actually a bug that could affect us in other situations, so it's good that implementing this endpoint has exposed it.

Some history: when the concept of isolating datafeeds was first added, we always isolated all datafeeds running on a particular node. Later the ability to just isolate a single datafeed was added, and this is when the bug was introduced.

DatafeedManager has member variable isolated that is set to true when all datafeeds are isolated simultaneously. This dates back to the time when the only way to isolate datafeeds was to isolate all datafeeds on a node. Now that it is possible to isolate a single datafeed we should not have this variable at all. Please delete this variable from the code. All the tests of isolated in DatafeedManager should instead be testing the value of holder.isIsolated(), which is set on a per-datafeed basis when that datafeed is isolated.

The test failure is coming from this piece of code in DatafeedManager:

                if (isolated == false) {
                    if (next != null) {
                        doDatafeedRealtime(next, holder.datafeedJob.getJobId(), holder);
                    } else {
                        holder.stop("no_realtime", TimeValue.timeValueSeconds(20), null);
                        holder.problemTracker.finishReport();
                    }
                }

The intention is not to react to a lookback finishing if the datafeed is isolated, but we are reacting because we're checking the variable that's only set when all datafeeds are isolated by a single function call. Changing it to this should fix the problem:

                if (holder.isIsolated() == false) {
                    if (next != null) {
                        doDatafeedRealtime(next, holder.datafeedJob.getJobId(), holder);
                    } else {
                        holder.stop("no_realtime", TimeValue.timeValueSeconds(20), null);
                        holder.problemTracker.finishReport();
                    }
                }

Also renamed a few methods/variables for clarity and added some comments

droberts195 · 2019-02-06T09:31:01Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedManager.java

                    finishHandler.accept(e);
                    logger.info("[{}] datafeed [{}] for job [{}] has been stopped{}", source, datafeedId, datafeedJob.getJobId(),
                            acquired ? "" : ", but there may be pending tasks as the timeout [" + timeout.getStringRep() + "] expired");
-                    if (autoCloseJob) {
+                    if (autoCloseJob && isIsolated() == false) {


This highlights the massive complexity caused by separating jobs and datafeeds yet then trying to make them work together in a user friendly way.

There are 3 scenarios where we isolate datafeeds:

Node shutting down - this is handled by the if at the very beginning of the method.

Upgrade mode - in this case we most definitely do not want to close the associated job.

Force-delete datafeed - this usually happens as part of the UI deleting a job and datafeed, so closing the job is unnecessary as the force-delete of the job that comes soon after the force-delete of the datafeed will do that. But theoretically a user could force-delete just the datafeed and not the job, and in this case it would make sense to auto-close the job. However, leaving the job open is the path that is least likely to do damage, so we'll go with that.

All this highlights the total lack of encapsulation between jobs and datafeeds. The fact that you need an intimate knowledge of exactly how every usage pattern plays out to make a small change to datafeeds shows that it would make things much easier if they were one and the same thing, as suggested by #34231. But since implementing #34231 would be a multi-month project maybe the best thing is just to get things working with the current separation and then try to change as little as possible in this area.

droberts195 · 2019-02-06T10:00:49Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedManager.java

@@ -324,7 +325,7 @@ boolean isIsolated() {
        }

        public void stop(String source, TimeValue timeout, Exception e) {
-            if (isRelocating) {
+            if (isNodeShuttingDown) {


I renamed this variable to make clearer what it's used for. In the case of isolating a datafeed prior to enabling upgrade mode the datafeed may relocate to a different node when upgrade mode is eventually cancelled. But as long as the node where this code is running remains up it needs to run the finishHandler.accept(e); line below otherwise the local task associated with the persistent task will linger and cause problems.

dimitris-athanasiou

LGTM

droberts195 · 2019-02-06T14:07:05Z

After merging to master this will now need backporting to 7.x, 7.0 and 6.7.

* ML: update set_upgrade_mode, add logging * Attempt to fix datafeed isolation Also renamed a few methods/variables for clarity and added some comments

* master: ML: update set_upgrade_mode, add logging (elastic#38372) bad formatted JSON object (elastic#38515) (elastic#38525) Fix HistoryIntegrationTests timestamp comparsion (elastic#38505) SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440) Fix the clock resolution to millis in ScheduledEventTests (elastic#38506)

* ML: update set_upgrade_mode, add logging * Attempt to fix datafeed isolation Also renamed a few methods/variables for clarity and added some comments

* 6.7: Fix build on Fips (elastic#38546) ML: update set_upgrade_mode, add logging (elastic#38372) (elastic#38539) Update 'removal of types' docs to reflect the new plan. (elastic#38548) Make the 'get templates' types deprecation message consistent. (elastic#38532) Add API key settings documentation (elastic#38490) (elastic#38499) Mute testExpiredApiKeysDeletedAfter1Week in 6.7 (elastic#38528)

* master: (27 commits) Mute AnalysisModuleTests#testStandardFilterBWC (elastic#38636) add geotile_grid ref to asciidoc (elastic#38632) Enable Dockerfile from artifacts.elastic.co (elastic#38552) Mute FollowerFailOverIT testFailOverOnFollower (elastic#38634) Account for a possible rolled over file while reading the audit log file (elastic#34909) Mute failure in InternalEngineTests (elastic#38622) Fix Issue with Concurrent Snapshot Init + Delete (elastic#38518) Refactor ZonedDateTime.now in millis resolution (elastic#38577) Mute failing WatchStatusIntegrationTests (elastic#38621) Mute failing ApiKeyIntegTests (elastic#38614) [DOCS] Add warning about bypassing ML PUT APIs (elastic#38509) Add 7.1 and 8.0 version constants to master (elastic#38514) ML: update set_upgrade_mode, add logging (elastic#38372) bad formatted JSON object (elastic#38515) (elastic#38525) Fix HistoryIntegrationTests timestamp comparsion (elastic#38505) SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440) Fix the clock resolution to millis in ScheduledEventTests (elastic#38506) Enable BWC after backport recovering leases (elastic#38485) Collapse retention lease integration tests (elastic#38483) TransportVerifyShardBeforeCloseAction should force a flush (elastic#38401) ...

ML: update set_upgrade_mode, add logging

b2fadd4

benwtrent added >test Issues or PRs that are addressing/adding tests v7.0.0 :ml Machine learning v6.7.0 labels Feb 4, 2019

droberts195 previously approved these changes Feb 5, 2019

View reviewed changes

Attempt to fix datafeed isolation

5dc1d94

Also renamed a few methods/variables for clarity and added some comments

droberts195 reviewed Feb 6, 2019

View reviewed changes

dimitris-athanasiou approved these changes Feb 6, 2019

View reviewed changes

droberts195 added v8.0.0 v7.2.0 labels Feb 6, 2019

jasontedor removed the v7.0.0 label Feb 6, 2019

droberts195 added the v7.0.0 label Feb 6, 2019

jasontedor removed the v7.0.0 label Feb 6, 2019

droberts195 added the v7.0.0 label Feb 6, 2019

benwtrent merged commit 3bf8e24 into elastic:master Feb 6, 2019

benwtrent deleted the feature/ml-set-upgrade-mode-add-logging-adj-test branch February 6, 2019 20:39

benwtrent mentioned this pull request Feb 6, 2019

ML: update set_upgrade_mode, add logging (#38372) #38537

Merged

This was referenced Feb 6, 2019

ML: update set_upgrade_mode, add logging (#38372) #38538

Merged

ML: update set_upgrade_mode, add logging (#38372) #38539

Merged

colings86 added the v7.0.0-beta1 label Feb 7, 2019

colings86 removed the v7.0.0 label Feb 7, 2019

benwtrent added a commit that referenced this pull request Feb 7, 2019

ML: update set_upgrade_mode, add logging (#38372) (#38539)

622a7f1

* ML: update set_upgrade_mode, add logging * Attempt to fix datafeed isolation Also renamed a few methods/variables for clarity and added some comments

benwtrent added a commit that referenced this pull request Feb 8, 2019

ML: update set_upgrade_mode, add logging (#38372) (#38537)

aedfa7e

* ML: update set_upgrade_mode, add logging * Attempt to fix datafeed isolation Also renamed a few methods/variables for clarity and added some comments

benwtrent added a commit that referenced this pull request Feb 8, 2019

ML: update set_upgrade_mode, add logging (#38372) (#38538)

24a8ea0

* ML: update set_upgrade_mode, add logging * Attempt to fix datafeed isolation Also renamed a few methods/variables for clarity and added some comments

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML: update set_upgrade_mode, add logging #38372

ML: update set_upgrade_mode, add logging #38372

benwtrent commented Feb 4, 2019

elasticmachine commented Feb 4, 2019

benwtrent commented Feb 5, 2019

droberts195 left a comment

droberts195 commented Feb 5, 2019 •

edited

Loading

droberts195 Feb 6, 2019

droberts195 Feb 6, 2019

dimitris-athanasiou left a comment

droberts195 commented Feb 6, 2019

ML: update set_upgrade_mode, add logging #38372

ML: update set_upgrade_mode, add logging #38372

Conversation

benwtrent commented Feb 4, 2019

elasticmachine commented Feb 4, 2019

benwtrent commented Feb 5, 2019

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 commented Feb 5, 2019 • edited Loading

droberts195 Feb 6, 2019

Choose a reason for hiding this comment

droberts195 Feb 6, 2019

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

droberts195 commented Feb 6, 2019

droberts195 commented Feb 5, 2019 •

edited

Loading