-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML: update set_upgrade_mode, add logging #38372
ML: update set_upgrade_mode, add logging #38372
Conversation
Pinging @elastic/ml-core |
run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
On further analysis I can see that the problem is not fixed
I've had a closer look at what's happening here, and have spotted a bug that is the most likely cause. It's actually a bug that could affect us in other situations, so it's good that implementing this endpoint has exposed it. Some history: when the concept of isolating datafeeds was first added, we always isolated all datafeeds running on a particular node. Later the ability to just isolate a single datafeed was added, and this is when the bug was introduced.
The test failure is coming from this piece of code in
The intention is not to react to a lookback finishing if the datafeed is isolated, but we are reacting because we're checking the variable that's only set when all datafeeds are isolated by a single function call. Changing it to this should fix the problem:
|
Also renamed a few methods/variables for clarity and added some comments
finishHandler.accept(e); | ||
logger.info("[{}] datafeed [{}] for job [{}] has been stopped{}", source, datafeedId, datafeedJob.getJobId(), | ||
acquired ? "" : ", but there may be pending tasks as the timeout [" + timeout.getStringRep() + "] expired"); | ||
if (autoCloseJob) { | ||
if (autoCloseJob && isIsolated() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This highlights the massive complexity caused by separating jobs and datafeeds yet then trying to make them work together in a user friendly way.
There are 3 scenarios where we isolate datafeeds:
- Node shutting down - this is handled by the
if
at the very beginning of the method. - Upgrade mode - in this case we most definitely do not want to close the associated job.
- Force-delete datafeed - this usually happens as part of the UI deleting a job and datafeed, so closing the job is unnecessary as the force-delete of the job that comes soon after the force-delete of the datafeed will do that. But theoretically a user could force-delete just the datafeed and not the job, and in this case it would make sense to auto-close the job. However, leaving the job open is the path that is least likely to do damage, so we'll go with that.
All this highlights the total lack of encapsulation between jobs and datafeeds. The fact that you need an intimate knowledge of exactly how every usage pattern plays out to make a small change to datafeeds shows that it would make things much easier if they were one and the same thing, as suggested by #34231. But since implementing #34231 would be a multi-month project maybe the best thing is just to get things working with the current separation and then try to change as little as possible in this area.
@@ -324,7 +325,7 @@ boolean isIsolated() { | |||
} | |||
|
|||
public void stop(String source, TimeValue timeout, Exception e) { | |||
if (isRelocating) { | |||
if (isNodeShuttingDown) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this variable to make clearer what it's used for. In the case of isolating a datafeed prior to enabling upgrade mode the datafeed may relocate to a different node when upgrade mode is eventually cancelled. But as long as the node where this code is running remains up it needs to run the finishHandler.accept(e);
line below otherwise the local task associated with the persistent task will linger and cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
After merging to master this will now need backporting to 7.x, 7.0 and 6.7. |
* ML: update set_upgrade_mode, add logging * Attempt to fix datafeed isolation Also renamed a few methods/variables for clarity and added some comments
* ML: update set_upgrade_mode, add logging * Attempt to fix datafeed isolation Also renamed a few methods/variables for clarity and added some comments
* master: ML: update set_upgrade_mode, add logging (elastic#38372) bad formatted JSON object (elastic#38515) (elastic#38525) Fix HistoryIntegrationTests timestamp comparsion (elastic#38505) SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440) Fix the clock resolution to millis in ScheduledEventTests (elastic#38506)
* 6.7: Fix build on Fips (elastic#38546) ML: update set_upgrade_mode, add logging (elastic#38372) (elastic#38539) Update 'removal of types' docs to reflect the new plan. (elastic#38548) Make the 'get templates' types deprecation message consistent. (elastic#38532) Add API key settings documentation (elastic#38490) (elastic#38499) Mute testExpiredApiKeysDeletedAfter1Week in 6.7 (elastic#38528)
* master: (27 commits) Mute AnalysisModuleTests#testStandardFilterBWC (elastic#38636) add geotile_grid ref to asciidoc (elastic#38632) Enable Dockerfile from artifacts.elastic.co (elastic#38552) Mute FollowerFailOverIT testFailOverOnFollower (elastic#38634) Account for a possible rolled over file while reading the audit log file (elastic#34909) Mute failure in InternalEngineTests (elastic#38622) Fix Issue with Concurrent Snapshot Init + Delete (elastic#38518) Refactor ZonedDateTime.now in millis resolution (elastic#38577) Mute failing WatchStatusIntegrationTests (elastic#38621) Mute failing ApiKeyIntegTests (elastic#38614) [DOCS] Add warning about bypassing ML PUT APIs (elastic#38509) Add 7.1 and 8.0 version constants to master (elastic#38514) ML: update set_upgrade_mode, add logging (elastic#38372) bad formatted JSON object (elastic#38515) (elastic#38525) Fix HistoryIntegrationTests timestamp comparsion (elastic#38505) SQL: Fix issue with IN not resolving to underlying keyword field (elastic#38440) Fix the clock resolution to millis in ScheduledEventTests (elastic#38506) Enable BWC after backport recovering leases (elastic#38485) Collapse retention lease integration tests (elastic#38483) TransportVerifyShardBeforeCloseAction should force a flush (elastic#38401) ...
There have been some flaky tests in the
set_upgrade_mode
yml tests. This PR adds some logging (to aid future debugging) and hopefully addresses the flakiness of the tests.I have not been able to reproduce this scenario: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java8,nodes=immutable&&linux&&docker/221/console
from what I can tell in the logs, the datafeed was stopped
The
no_realtime
indicates that for some reason the Datafeed was not being treated as a realtime job. To combat, I am adding a index health check + logs to aid in troubleshooting again in the future.