Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML: update set_upgrade_mode, add logging #38372

Conversation

benwtrent
Copy link
Member

There have been some flaky tests in the set_upgrade_mode yml tests. This PR adds some logging (to aid future debugging) and hopefully addresses the flakiness of the tests.

I have not been able to reproduce this scenario: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=immutable&&linux&&docker/221/console

from what I can tell in the logs, the datafeed was stopped

./x-pack/plugin/build/cluster/integTestCluster node0/elasticsearch-6.7.0-SNAPSHOT/logs/x-pack_plugin_integTestCluster.log:[2019-02-03T23:49:03,257][INFO ][o.e.x.m.d.DatafeedManager] [node-0] [no_realtime] try lock [20s] to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]...

The no_realtime indicates that for some reason the Datafeed was not being treated as a realtime job. To combat, I am adding a index health check + logs to aid in troubleshooting again in the future.

@benwtrent benwtrent added >test Issues or PRs that are addressing/adding tests v7.0.0 :ml Machine learning v6.7.0 labels Feb 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@benwtrent
Copy link
Member Author

run elasticsearch-ci/1

droberts195
droberts195 previously approved these changes Feb 5, 2019
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 dismissed their stale review February 5, 2019 10:41

On further analysis I can see that the problem is not fixed

@droberts195
Copy link
Contributor

droberts195 commented Feb 5, 2019

I've had a closer look at what's happening here, and have spotted a bug that is the most likely cause. It's actually a bug that could affect us in other situations, so it's good that implementing this endpoint has exposed it.

Some history: when the concept of isolating datafeeds was first added, we always isolated all datafeeds running on a particular node. Later the ability to just isolate a single datafeed was added, and this is when the bug was introduced.

DatafeedManager has member variable isolated that is set to true when all datafeeds are isolated simultaneously. This dates back to the time when the only way to isolate datafeeds was to isolate all datafeeds on a node. Now that it is possible to isolate a single datafeed we should not have this variable at all. Please delete this variable from the code. All the tests of isolated in DatafeedManager should instead be testing the value of holder.isIsolated(), which is set on a per-datafeed basis when that datafeed is isolated.

The test failure is coming from this piece of code in DatafeedManager:

                if (isolated == false) {
                    if (next != null) {
                        doDatafeedRealtime(next, holder.datafeedJob.getJobId(), holder);
                    } else {
                        holder.stop("no_realtime", TimeValue.timeValueSeconds(20), null);
                        holder.problemTracker.finishReport();
                    }
                }

The intention is not to react to a lookback finishing if the datafeed is isolated, but we are reacting because we're checking the variable that's only set when all datafeeds are isolated by a single function call. Changing it to this should fix the problem:

                if (holder.isIsolated() == false) {
                    if (next != null) {
                        doDatafeedRealtime(next, holder.datafeedJob.getJobId(), holder);
                    } else {
                        holder.stop("no_realtime", TimeValue.timeValueSeconds(20), null);
                        holder.problemTracker.finishReport();
                    }
                }

Also renamed a few methods/variables for clarity and added
some comments
finishHandler.accept(e);
logger.info("[{}] datafeed [{}] for job [{}] has been stopped{}", source, datafeedId, datafeedJob.getJobId(),
acquired ? "" : ", but there may be pending tasks as the timeout [" + timeout.getStringRep() + "] expired");
if (autoCloseJob) {
if (autoCloseJob && isIsolated() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This highlights the massive complexity caused by separating jobs and datafeeds yet then trying to make them work together in a user friendly way.

There are 3 scenarios where we isolate datafeeds:

  1. Node shutting down - this is handled by the if at the very beginning of the method.
  2. Upgrade mode - in this case we most definitely do not want to close the associated job.
  3. Force-delete datafeed - this usually happens as part of the UI deleting a job and datafeed, so closing the job is unnecessary as the force-delete of the job that comes soon after the force-delete of the datafeed will do that. But theoretically a user could force-delete just the datafeed and not the job, and in this case it would make sense to auto-close the job. However, leaving the job open is the path that is least likely to do damage, so we'll go with that.

All this highlights the total lack of encapsulation between jobs and datafeeds. The fact that you need an intimate knowledge of exactly how every usage pattern plays out to make a small change to datafeeds shows that it would make things much easier if they were one and the same thing, as suggested by #34231. But since implementing #34231 would be a multi-month project maybe the best thing is just to get things working with the current separation and then try to change as little as possible in this area.

@@ -324,7 +325,7 @@ boolean isIsolated() {
}

public void stop(String source, TimeValue timeout, Exception e) {
if (isRelocating) {
if (isNodeShuttingDown) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this variable to make clearer what it's used for. In the case of isolating a datafeed prior to enabling upgrade mode the datafeed may relocate to a different node when upgrade mode is eventually cancelled. But as long as the node where this code is running remains up it needs to run the finishHandler.accept(e); line below otherwise the local task associated with the persistent task will linger and cause problems.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195
Copy link
Contributor

After merging to master this will now need backporting to 7.x, 7.0 and 6.7.

@jasontedor jasontedor removed the v7.0.0 label Feb 6, 2019
@jasontedor jasontedor removed the v7.0.0 label Feb 6, 2019
@benwtrent benwtrent merged commit 3bf8e24 into elastic:master Feb 6, 2019
@benwtrent benwtrent deleted the feature/ml-set-upgrade-mode-add-logging-adj-test branch February 6, 2019 20:39
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Feb 6, 2019
* ML: update set_upgrade_mode, add logging

* Attempt to fix datafeed isolation

Also renamed a few methods/variables for clarity and added
some comments
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Feb 6, 2019
* ML: update set_upgrade_mode, add logging

* Attempt to fix datafeed isolation

Also renamed a few methods/variables for clarity and added
some comments
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 6, 2019
* master:
  ML: update set_upgrade_mode, add logging (elastic#38372)
  bad formatted JSON object (elastic#38515) (elastic#38525)
  Fix HistoryIntegrationTests timestamp comparsion (elastic#38505)
  SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440)
  Fix the clock resolution to millis in ScheduledEventTests (elastic#38506)
@colings86 colings86 removed the v7.0.0 label Feb 7, 2019
benwtrent added a commit that referenced this pull request Feb 7, 2019
* ML: update set_upgrade_mode, add logging

* Attempt to fix datafeed isolation

Also renamed a few methods/variables for clarity and added
some comments
benwtrent added a commit that referenced this pull request Feb 8, 2019
* ML: update set_upgrade_mode, add logging

* Attempt to fix datafeed isolation

Also renamed a few methods/variables for clarity and added
some comments
benwtrent added a commit that referenced this pull request Feb 8, 2019
* ML: update set_upgrade_mode, add logging

* Attempt to fix datafeed isolation

Also renamed a few methods/variables for clarity and added
some comments
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 8, 2019
* 6.7:
  Fix build on Fips (elastic#38546)
  ML: update set_upgrade_mode, add logging (elastic#38372) (elastic#38539)
  Update 'removal of types' docs to reflect the new plan. (elastic#38548)
  Make the 'get templates' types deprecation message consistent. (elastic#38532)
  Add API key settings documentation (elastic#38490) (elastic#38499)
  Mute testExpiredApiKeysDeletedAfter1Week in 6.7 (elastic#38528)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 11, 2019
* master: (27 commits)
  Mute AnalysisModuleTests#testStandardFilterBWC (elastic#38636)
  add geotile_grid ref to asciidoc (elastic#38632)
  Enable Dockerfile from artifacts.elastic.co (elastic#38552)
  Mute FollowerFailOverIT testFailOverOnFollower (elastic#38634)
  Account for a possible rolled over file while reading the audit log file (elastic#34909)
  Mute failure in InternalEngineTests (elastic#38622)
  Fix Issue with Concurrent Snapshot Init + Delete (elastic#38518)
  Refactor ZonedDateTime.now in millis resolution (elastic#38577)
  Mute failing WatchStatusIntegrationTests (elastic#38621)
  Mute failing  ApiKeyIntegTests (elastic#38614)
  [DOCS] Add warning about bypassing ML PUT APIs (elastic#38509)
  Add 7.1 and 8.0 version constants to master (elastic#38514)
  ML: update set_upgrade_mode, add logging (elastic#38372)
  bad formatted JSON object (elastic#38515) (elastic#38525)
  Fix HistoryIntegrationTests timestamp comparsion (elastic#38505)
  SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440)
  Fix the clock resolution to millis in ScheduledEventTests (elastic#38506)
  Enable BWC after backport recovering leases (elastic#38485)
  Collapse retention lease integration tests (elastic#38483)
  TransportVerifyShardBeforeCloseAction should force a flush (elastic#38401)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test Issues or PRs that are addressing/adding tests v6.7.0 v7.0.0-beta1 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants