Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WatcherYamlRestIT fails on test get watch api with missing watch on existing index #53177

Closed
DaveCTurner opened this issue Mar 5, 2020 · 14 comments
Labels
:Data Management/Watcher Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

The following two builds failed for similar-looking reasons:

The failure looked like this:


org.elasticsearch.smoketest.WatcherYamlRestIT > test {p0=watcher/put_watch/10_basic/Test put watch api} FAILED |  
-- | --
  | java.lang.AssertionError: |  
  | Expected: <0> |  
  | but: was <1> |  
  | at __randomizedtesting.SeedInfo.seed([BC95DB412F8CDAF0:34C1E49B8170B708]:0) |  
  | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  | at org.junit.Assert.assertThat(Assert.java:956) |  
  | at org.junit.Assert.assertThat(Assert.java:923) |  
  | at org.elasticsearch.xpack.watcher.WatcherYamlSuiteTestCase.lambda$stopWatcher$1(WatcherYamlSuiteTestCase.java:74) |  
  | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:906) |  
  | at org.elasticsearch.xpack.watcher.WatcherYamlSuiteTestCase.stopWatcher(WatcherYamlSuiteTestCase.java:67) |  
  | at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) |  
  | at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) |  
  | at java.lang.reflect.Method.invoke(Method.java:498) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) |  
  | at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) |  
  | at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) |  
  | at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) |  
  | at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) |  
  | at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817) |  
  | at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883) |  
  | at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894) |  
  | at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) |  
  | at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) |  
  | at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) |  
  | at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) |  
  | at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) |  
  | at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) |  
  | at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) |  
  | at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368) |  
  | at java.lang.Thread.run(Thread.java:748) |  
  | REPRODUCE WITH: ./gradlew ':x-pack:plugin:watcher:qa:rest:integTestRunner' --tests "org.elasticsearch.smoketest.WatcherYamlRestIT.test {p0=watcher/put_watch/10_basic/Test put watch api}" -Dtests.seed=BC95DB412F8CDAF0 -Dtests.security.manager=true -Dtests.locale=es-SV -Dtests.timezone=GMT0 -Dcompiler.java=13 |  
  |  


I haven't attempted to reproduce this.

@DaveCTurner DaveCTurner added >test-failure Triaged test failures from CI :Data Management/Watcher labels Mar 5, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Watcher)

@mark-vieira
Copy link
Contributor

I tried, but it didn't reproduce locally for me.

martijnvg added a commit that referenced this issue Mar 6, 2020
This check was added as part of: 0f2d26b

Checking this before the test starts makes more sense, because
the watches index has then also be removed.

Relates to #53177
@jaymode
Copy link
Member

jaymode commented Mar 6, 2020

This also failed on 7.x today https://gradle-enterprise.elastic.co/s/izleuckcsczy6/tests/j3inyc2qyhaby-3e5dsrwmfrm6w.

@martijnvg I see you have a commit that mentions this issue. I wonder if there might be a race condition in stopping the TriggerService where the TriggerService is stopped but a watch is added concurrently/after the stop of the triggerservice somehow, maybe by the WatcherIndexingListener?

martijnvg added a commit that referenced this issue Mar 9, 2020
This check was added as part of: 0f2d26b

Checking this before the test starts makes more sense, because
the watches index has then also be removed.

Relates to #53177
@martijnvg
Copy link
Member

Thanks @jaymode for triaging this failure. I also did some investigating and if watcher is stopped then the watch_count returned by watcher stats api is 0, because all watcher (trigger service) are paused. However if watcher is stopped, a watch can still be added via the api and this increases the watch count in the watcher stats api by 1. This can also happen if for example a write is retried on a .watches shard copy that has watches allocated. The expectation is that if watcher is stopped then that the watch_count is always zero, but this isn't true. The WatcherIndexingListener doesn't check wether watcher is stopped, just whether it should delegate specific watches to the trigger service. I think that WatcherIndexingListener (and maybe other places too) should also check what the watcher_state is and act on it.

I'm not sure whether this problem occurs here, the watcher rest qa module doesn't execute tests parallel and I've not yet found evidence that a watch was added after watcher was stopped.

The commit that I pushed, delays the checking of the watcher stats' watcher_count that happens between tests. Before this was done after watcher was stopped, but now after watcher is stopped and the .watches index has been removed. I think this make sense based on how the watcher stop api works, but this hasn't fixed the test failure.

@williamrandolph
Copy link
Contributor

There's another instance of this failure on 7.x: https://gradle-enterprise.elastic.co/s/xreqekg5chqoy

One failure on WatcherYamlRestIT:

2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:watcher:qa:rest:integTestRunner' --tests "org.elasticsearch.smoketest.WatcherYamlRestIT.test {p0=watcher/put_watch/10_basic/Test empty body is rejected by put watch}" -Dtests.seed=41277B042F438E42 -Dtests.security.manager=true -Dtests.locale=en-US -Dtests.timezone=Asia/Colombo -Dcompiler.java=13 |  
-- | --
  | 2> java.lang.AssertionError: |  
  | Expected: <0> |  
  | but: was <1> |  
  | at __randomizedtesting.SeedInfo.seed([41277B042F438E42:C97344DE81BFE3BA]:0) |  
  | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  | at org.junit.Assert.assertThat(Assert.java:956) |  
  | at org.junit.Assert.assertThat(Assert.java:923) |  
  | at org.elasticsearch.xpack.watcher.WatcherYamlSuiteTestCase.lambda$startWatcher$0(WatcherYamlSuiteTestCase.java:57) |  
  | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:906) |  
  | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:879) |  
  | at org.elasticsearch.xpack.watcher.WatcherYamlSuiteTestCase.startWatcher(WatcherYamlSuiteTestCase.java:39)
  […]

…and another on SmokeTestWatcherWithSecurityClientYamlTestSuiteIT:

2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:watcher:qa:with-security:integTestRunner' --tests "org.elasticsearch.smoketest.SmokeTestWatcherWithSecurityClientYamlTestSuiteIT.test {yaml=watcher/activate_watch/10_basic/Non existent watch returns 404}" -Dtests.seed=41277B042F438E42 -Dtests.security.manager=true -Dtests.locale=mt-MT -Dtests.timezone=Asia/Karachi -Dcompiler.java=13 |  
-- | --
  | 2> java.lang.AssertionError: |  
  | Expected: <0> |  
  | but: was <1> |  
  | at __randomizedtesting.SeedInfo.seed([41277B042F438E42:C97344DE81BFE3BA]:0) |  
  | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  | at org.junit.Assert.assertThat(Assert.java:956) |  
  | at org.junit.Assert.assertThat(Assert.java:923) |  
  | at org.elasticsearch.xpack.watcher.WatcherYamlSuiteTestCase.lambda$startWatcher$0(WatcherYamlSuiteTestCase.java:57) |  
  | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:906) |  
  | at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:879) |  
  | at org.elasticsearch.xpack.watcher.WatcherYamlSuiteTestCase.startWatcher(WatcherYamlSuiteTestCase.java:39)

@williamrandolph
Copy link
Contributor

One more on SmokeTestWatcherWithSecurityClientYamlTestSuiteIT, master branch: https://gradle-enterprise.elastic.co/s/43mjgxd6sesck

jakelandis added a commit to jakelandis/elasticsearch that referenced this issue Mar 12, 2020
PR elastic#53362 muted the entire suite of REST tests, but only 2 tests were
identified as failing. This PR un-mutes the full suite and only mutes
the 2 test identified on elastic#53177
@jakelandis
Copy link
Contributor

It appears that this may be a setup/tear down issue and build-stats shows more test failures than the ones referenced here. As of #53362 all of the REST tests for Watcher are muted and will look into this issue as soon as possible.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 13, 2020
martijnvg added a commit that referenced this issue Mar 13, 2020
Also log a message instead of failing if there are active watches at a beginning of a test.

Relates to #53177
martijnvg added a commit that referenced this issue Mar 16, 2020
Also log a message instead of failing if there are active watches at a beginning of a test.

Relates to #53177
@mayya-sharipova
Copy link
Contributor

Another failure of SmokeTestWatcherWithSecurityClientYamlTestSuiteIT on 7.x today, which doesn't reproduce for me.

./gradlew ':x-pack:plugin:watcher:qa:with-security:integTestRunner' --tests "org.elasticsearch.smoketest.SmokeTestWatcherWithSecurityClientYamlTestSuiteIT.test {yaml=watcher/usage/10_basic/Test watcher usage stats output}"
-Dtests.seed=21A43F67890E43A9
-Dtests.security.manager=true
-Dtests.locale=sl
-Dtests.timezone=Asia/Srednekolymsk
-Dcompiler.java=13

Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+part2/4330/console
Build Scans: https://gradle-enterprise.elastic.co/s/dhqipxj2iuyfc

Stack trace:
java.lang.AssertionError: Failure at [watcher/usage/10_basic:48]: field [watcher.count.active] is not greater than [$watch_count_active]
Expected: a value greater than <1>
but: <1> was equal to <1>
at __randomizedtesting.SeedInfo.seed([21A43F67890E43A9:A9F000BD27F22E51]:0)

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@benwtrent
Copy link
Member

@martijnvg @jakelandis new build failure: https://gradle-enterprise.elastic.co/s/y54f4tptbqszc

Here are the logs.
Remuting the tests
5532.tar.gz

benwtrent added a commit that referenced this issue May 11, 2020
muting watcher yaml tests. 

See issue #53177
@jakelandis
Copy link
Contributor

Basde on the last failure. The logs show:

org.elasticsearch.smoketest.SmokeTestWatcherWithSecurityClientYamlTestSuiteIT > test {yaml=watcher/usage/10_basic/Test watcher usage stats output} FAILED	
    java.lang.AssertionError: Failure at [watcher/usage/10_basic:48]: field [watcher.count.active] is not greater than [$watch_count_active]	
    Expected: a value greater than <1>	
         but: <1> was equal to <1>
  [2020-05-11T19:26:51,279][INFO ][o.e.s.SmokeTestWatcherWithSecurityClientYamlTestSuiteIT] [test] expected 0 active watches, but got [1], deleting watcher indices again	
  [2020-05-11T19:26:51,890][INFO ][o.e.s.SmokeTestWatcherWithSecurityClientYamlTestSuiteIT] [test] Stash dump on test failure [{	
    "stash" : {	
      "watch_count_active" : 1,

It seems that on a fresh start the the watch_count is 1. So either the stat count is not getting updated fast enough (e.g. the new watch didn't seem to be counted), or the prior watch was not deleted on the first call to usage (per the view of the usage api), but was deleted (per the view of the usage api) by the time it go to the second call. I suspect the second scenerio based on the logs above.

@jakelandis
Copy link
Contributor

Here is the server log with some context:

[2020-05-11T17:26:50,100][DEBUG][o.e.x.w.WatcherIndexingListener] [integTest-0] adding watch [NOixgXhHqnFMzEwLuZaY] to trigger service
[2020-05-11T17:26:50,119][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] finished [NOixgXhHqnFMzEwLuZaY]/[NOixgXhHqnFMzEwLuZaY_c5458308-4b38-4839-bd92-b8290a5103ef-2020-05-11T17:26:50.075109Z]
[2020-05-11T17:26:50,167][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [.watcher-history-11-2020.05.11/Q3WBHNaTT--vA6TCFZ_-2w] deleting index
[2020-05-11T17:26:50,250][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [ilm-history-2-000001/wU8QANvgQeOUlaxhw7w2VA] deleting index
[2020-05-11T17:26:50,251][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [index_not_allowed_to_read/CZ5c8XiPTGeadyJoO9iK5w] deleting index
[2020-05-11T17:26:50,251][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [.triggered_watches/4KLBpbx_TMCP0_dbekx-NQ] deleting index
[2020-05-11T17:26:50,251][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [my_test_index/_wVkq3PoSySN9Vvjm957gw] deleting index
[2020-05-11T17:26:50,366][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] removing template [ilm-history]
[2020-05-11T17:26:50,411][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [ilm-history] for index patterns [ilm-history-2*]
[2020-05-11T17:26:51,292][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates []
[2020-05-11T17:26:51,302][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [index_not_allowed_to_read] creating index, cause [auto(bulk api)], templates [], shards [1]/[1], mappings []
[2020-05-11T17:26:51,441][INFO ][o.e.c.m.MetadataMappingService] [integTest-0] [index_not_allowed_to_read/lcZSU9egRQeS8OVpxpfmEQ] create_mapping
[2020-05-11T17:26:51,579][DEBUG][o.e.x.w.t.s.e.TickerScheduleTriggerEngine] [integTest-0] triggered job [NOixgXhHqnFMzEwLuZaY] at [2020-05-11T17:26:51.579Z] (scheduled time was [2020-05-11T17:26:51.101Z])
[2020-05-11T17:26:51,580][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] watcher execution service paused, not processing [1] events
[2020-05-11T17:26:51,662][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates [.watches]
[2020-05-11T17:26:51,673][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [.watches] creating index, cause [auto(bulk api)], templates [.watches], shards [1]/[0], mappings [_doc]
[2020-05-11T17:26:51,791][INFO ][o.e.x.w.WatcherService   ] [integTest-0] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2020-05-11T17:26:51,801][DEBUG][o.e.x.w.WatcherService   ] [integTest-0] watch service has been reloaded, reason [new local watcher shard allocation ids]
[2020-05-11T17:26:51,808][DEBUG][o.e.x.w.WatcherIndexingListener] [integTest-0] adding watch [usage_stats_watch] to trigger service
[2020-05-11T17:26:51,968][INFO ][o.e.x.w.WatcherService   ] [integTest-0] stopping watch service, reason [watcher manually marked to shutdown by cluster state update]
[2020-05-11T17:26:51,969][INFO ][o.e.x.w.WatcherLifeCycleService] [integTest-0] watcher has stopped
[2020-05-11T17:26:51,986][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [.watches/fJA7O7GEREq12I8un3F7kQ] deleting index
[2020-05-11T17:26:52,084][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [index_not_allowed_to_read/lcZSU9egRQeS8OVpxpfmEQ] deleting index
[2020-05-11T17:26:52,161][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] removing template [ilm-history]
[2020-05-11T17:26:52,215][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [ilm-history] for index patterns [ilm-history-2*]
[2020-05-11T17:26:52,437][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates []
[2020-05-11T17:26:52,445][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [index_not_allowed_to_read] creating index, cause [auto(bulk api)], templates [], shards [1]/[1], mappings []
[2020-05-11T17:26:52,574][INFO ][o.e.c.m.MetadataMappingService] [integTest-0] [index_not_allowed_to_read/v19BwETJQ1i0B4Pjqa94sA] create_mapping
[2020-05-11T17:26:52,629][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates []
[2020-05-11T17:26:52,636][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [my_test_index] creating index, cause [auto(bulk api)], templates [], shards [1]/[1], mappings []
[2020-05-11T17:26:52,783][INFO ][o.e.c.m.MetadataMappingService] [integTest-0] [my_test_index/hdOuy235Qf2O-K45jhI99A] create_mapping
[2020-05-11T17:26:52,925][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates [.watches]
[2020-05-11T17:26:52,935][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [.watches] creating index, cause [auto(bulk api)], templates [.watches], shards [1]/[0], mappings [_doc]
[2020-05-11T17:26:53,027][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates [ilm-history]
[2020-05-11T17:26:53,037][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [ilm-history-2-000001] creating index, cause [api], templates [ilm-history], shards [1]/[0], mappings [_doc]
[2020-05-11T17:26:53,130][INFO ][o.e.x.w.WatcherService   ] [integTest-0] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2020-05-11T17:26:53,143][DEBUG][o.e.x.w.WatcherService   ] [integTest-0] watch service has been reloaded, reason [new local watcher shard allocation ids]
[2020-05-11T17:26:53,151][DEBUG][o.e.x.w.WatcherIndexingListener] [integTest-0] adding watch [my_watch] to trigger service

It looks like we still have start/stop issues specifically:

[2020-05-11T17:26:51,579][DEBUG][o.e.x.w.t.s.e.TickerScheduleTriggerEngine] [integTest-0] triggered job [NOixgXhHqnFMzEwLuZaY] at [2020-05-11T17:26:51.579Z] (scheduled time was [2020-05-11T17:26:51.101Z])
[2020-05-11T17:26:51,580][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] watcher execution service paused, not processing [1] events
[2020-05-11T17:26:51,662][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates [.watches]
[2020-05-11T17:26:51,673][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [.watches] creating index, cause [auto(bulk api)], templates [.watches], shards [1]/[0], mappings [_doc]
[2020-05-11T17:26:51,791][INFO ][o.e.x.w.WatcherService   ] [integTest-0] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2020-05-11T17:26:51,801][DEBUG][o.e.x.w.WatcherService   ] [integTest-0] watch service has been reloaded, reason [new local watcher shard allocation ids]
[2020-05-11T17:26:51,808][DEBUG][o.e.x.w.WatcherIndexingListener] [integTest-0] adding watch [usage_stats_watch] to trigger service
[2020-05-11T17:26:51,968][INFO ][o.e.x.w.WatcherService   ] [integTest-0] stopping watch service, reason [watcher manually marked to shutdown by cluster state update]
[2020-05-11T17:26:51,969][INFO ][o.e.x.w.WatcherLifeCycleService] [integTest-0] watcher has stopped

It triggered a job, paused execution, then creates the .watches indices (maybe due to the watch itself updating the .watches index ?) which causes watcher to restart in the background ?

@jakelandis
Copy link
Contributor

I think I see what is happening here ... When Watcher executes a Watch it will update the .watches index, on 'postIndex' to the .watches it will add that Watch to the trigger service. The trigger service does not have any notion of started/stopped state, so if we shutdown Watcher, it can be in a stopping or stopped state (which clears out the trigger service), but the trigger service is running and the update .watches added that watch back to the trigger service. So the watches can actually run via the trigger service even when the state is stopped. Subsequent watches don't actually run due the pause check (watcher execution service paused, not processing [1] events) , but adding watch back to the trigger service throws off the watch_count which is sourced from the trigger service.

I am still exploring the fixes...

my notes

[2020-05-11T17:26:47,638][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [.watches] creating index, cause [auto(bulk api)], templates [.watches], shards [1]/[0], mappings [_doc]
[2020-05-11T17:26:47,761][INFO ][o.e.x.w.WatcherService   ] [integTest-0] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2020-05-11T17:26:47,775][DEBUG][o.e.x.w.WatcherService   ] [integTest-0] watch service has been reloaded, reason [new local watcher shard allocation ids]
[2020-05-11T17:26:47,777][DEBUG][o.e.x.w.WatcherIndexingListener] [integTest-0] adding watch [NOixgXhHqnFMzEwLuZaY] to trigger service
[2020-05-11T17:26:49,069][DEBUG][o.e.x.w.t.s.e.TickerScheduleTriggerEngine] [integTest-0] triggered job [NOixgXhHqnFMzEwLuZaY] at [2020-05-11T17:26:49.069Z] (scheduled time was [2020-05-11T17:26:48.777Z])
[2020-05-11T17:26:49,074][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates [.triggered_watches]
[2020-05-11T17:26:49,083][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [.triggered_watches] creating index, cause [auto(bulk api)], templates [.triggered_watches], shards [1]/[1], mappings [_doc]
[2020-05-11T17:26:49,085][INFO ][o.e.c.r.a.AllocationService] [integTest-0] updating number_of_replicas to [0] for indices [.triggered_watches]
[2020-05-11T17:26:49,220][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] executing watch [NOixgXhHqnFMzEwLuZaY]
[2020-05-11T17:26:49,225][DEBUG][o.e.x.w.i.s.ExecutableSearchInput] [integTest-0] [NOixgXhHqnFMzEwLuZaY_03681829-7911-4825-b112-247d3741e8b7-2020-05-11T17:26:49.070095Z] found [1] hits
[2020-05-11T17:26:49,226][DEBUG][o.e.x.w.i.s.ExecutableSearchInput] [integTest-0] [NOixgXhHqnFMzEwLuZaY_03681829-7911-4825-b112-247d3741e8b7-2020-05-11T17:26:49.070095Z] hit [{value=15}]

[2020-05-11T17:26:49,226][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [integTest-0] successfully ran NOixgXhHqnFMzEwLuZaYto test for search input <--- testSearchInputHasPermissions is running - note the typo of no space - the interval should be every 4s ... but the watch gets added again right away. 

[2020-05-11T17:26:49,241][DEBUG][o.e.x.w.WatcherIndexingListener] [integTest-0] adding watch [NOixgXhHqnFMzEwLuZaY] to trigger service <-- post index to updating the .watches index ?? (pretty sure yes)

[2020-05-11T17:26:49,247][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates [.watch-history-11]
[2020-05-11T17:26:49,259][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [.watcher-history-11-2020.05.11] creating index, cause [auto(bulk api)], templates [.watch-history-11], shards [1]/[0], mappings [_doc]
[2020-05-11T17:26:49,400][INFO ][o.e.c.m.MetadataMappingService] [integTest-0] [.watcher-history-11-2020.05.11/Q3WBHNaTT--vA6TCFZ_-2w] update_mapping [_doc]
[2020-05-11T17:26:49,455][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] finished [NOixgXhHqnFMzEwLuZaY]/[NOixgXhHqnFMzEwLuZaY_03681829-7911-4825-b112-247d3741e8b7-2020-05-11T17:26:49.070095Z]


[2020-05-11T17:26:50,074][DEBUG][o.e.x.w.t.s.e.TickerScheduleTriggerEngine] [integTest-0] triggered job [NOixgXhHqnFMzEwLuZaY] at [2020-05-11T17:26:50.074Z] (scheduled time was [2020-05-11T17:26:49.777Z]) <-- job is triggered!
[2020-05-11T17:26:50,080][INFO ][o.e.x.w.WatcherService   ] [integTest-0] stopping watch service, reason [watcher manually marked to shutdown by cluster state update]
[2020-05-11T17:26:50,080][INFO ][o.e.x.w.WatcherLifeCycleService] [integTest-0] watcher has stopped  <-- we stopped watcher, the test thinks it has stopped ... but it hasn't ! ... but there is still a a rouge triggerd job!

[2020-05-11T17:26:50,087][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] executing watch [NOixgXhHqnFMzEwLuZaY]
[2020-05-11T17:26:50,092][DEBUG][o.e.x.w.i.s.ExecutableSearchInput] [integTest-0] [NOixgXhHqnFMzEwLuZaY_c5458308-4b38-4839-bd92-b8290a5103ef-2020-05-11T17:26:50.075109Z] found [1] hits
[2020-05-11T17:26:50,092][DEBUG][o.e.x.w.i.s.ExecutableSearchInput] [integTest-0] [NOixgXhHqnFMzEwLuZaY_c5458308-4b38-4839-bd92-b8290a5103ef-2020-05-11T17:26:50.075109Z] hit [{value=15}]
[2020-05-11T17:26:50,095][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [.watches/Y3jGo4ZARduQMwW0bbf7Uw] deleting index

[2020-05-11T17:26:50,100][DEBUG][o.e.x.w.WatcherIndexingListener] [integTest-0] adding watch [NOixgXhHqnFMzEwLuZaY] to trigger service  <-- adding it back due to the .watches update which effectivly restart Watcher

[2020-05-11T17:26:50,119][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] finished [NOixgXhHqnFMzEwLuZaY]/[NOixgXhHqnFMzEwLuZaY_c5458308-4b38-4839-bd92-b8290a5103ef-2020-05-11T17:26:50.075109Z]
[2020-05-11T17:26:50,167][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [.watcher-history-11-2020.05.11/Q3WBHNaTT--vA6TCFZ_-2w] deleting index
[2020-05-11T17:26:50,250][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [ilm-history-2-000001/wU8QANvgQeOUlaxhw7w2VA] deleting index
[2020-05-11T17:26:50,251][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [index_not_allowed_to_read/CZ5c8XiPTGeadyJoO9iK5w] deleting index
[2020-05-11T17:26:50,251][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [.triggered_watches/4KLBpbx_TMCP0_dbekx-NQ] deleting index
[2020-05-11T17:26:50,251][INFO ][o.e.c.m.MetadataDeleteIndexService] [integTest-0] [my_test_index/_wVkq3PoSySN9Vvjm957gw] deleting index
[2020-05-11T17:26:50,366][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] removing template [ilm-history]
[2020-05-11T17:26:50,411][INFO ][o.e.c.m.MetadataIndexTemplateService] [integTest-0] adding template [ilm-history] for index patterns [ilm-history-2*]
[2020-05-11T17:26:51,292][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates []
[2020-05-11T17:26:51,302][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [index_not_allowed_to_read] creating index, cause [auto(bulk api)], templates [], shards [1]/[1], mappings []
[2020-05-11T17:26:51,441][INFO ][o.e.c.m.MetadataMappingService] [integTest-0] [index_not_allowed_to_read/lcZSU9egRQeS8OVpxpfmEQ] create_mapping
[2020-05-11T17:26:51,579][DEBUG][o.e.x.w.t.s.e.TickerScheduleTriggerEngine] [integTest-0] triggered job [NOixgXhHqnFMzEwLuZaY] at [2020-05-11T17:26:51.579Z] (scheduled time was [2020-05-11T17:26:51.101Z])
[2020-05-11T17:26:51,580][DEBUG][o.e.x.w.e.ExecutionService] [integTest-0] watcher execution service paused, not processing [1] events
[2020-05-11T17:26:51,662][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] applying create index request using v1 templates [.watches]
[2020-05-11T17:26:51,673][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [.watches] creating index, cause [auto(bulk api)], templates [.watches], shards [1]/[0], mappings [_doc]
[2020-05-11T17:26:51,791][INFO ][o.e.x.w.WatcherService   ] [integTest-0] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2020-05-11T17:26:51,801][DEBUG][o.e.x.w.WatcherService   ] [integTest-0] watch service has been reloaded, reason [new local watcher shard allocation ids]
[2020-05-11T17:26:51,808][DEBUG][o.e.x.w.WatcherIndexingListener] [integTest-0] adding watch [usage_stats_watch] to trigger service
[2020-05-11T17:26:51,968][INFO ][o.e.x.w.WatcherService   ] [integTest-0] stopping watch service, reason [watcher manually marked to shutdown by cluster state update]
[2020-05-11T17:26:51,969][INFO ][o.e.x.w.WatcherLifeCycleService] [integTest-0] watcher has stopped

jakelandis added a commit that referenced this issue May 12, 2020
Watcher adds watches to the trigger service on the postIndex action
for the .watches index. This has the (intentional) side effect of also
adding the watches to the stats. The tests rely on these stats for their
assertions. The tests also start and stop Watcher between each test for
a clean slate.

When Watcher executes it updates the .watches index and upon this update
it will go through the postIndex method and end up added that watch to the
trigger service (and stats). Functionally this is not a problem, if Watcher
is stopping or stopped since Watcher is also paused and will not execute
the watch. However, with specific timing and expectations of a clean slate
can cause issues the test assertions against the stats.

This commit ensures that the postIndex action only adds to the trigger service
if the Watcher state is not stopping or stopped. When started back up it will
re-read index .watches.

This commit also un-mutes the tests related to #53177 and #56534
jakelandis added a commit to jakelandis/elasticsearch that referenced this issue May 12, 2020
Watcher adds watches to the trigger service on the postIndex action
for the .watches index. This has the (intentional) side effect of also
adding the watches to the stats. The tests rely on these stats for their
assertions. The tests also start and stop Watcher between each test for
a clean slate.

When Watcher executes it updates the .watches index and upon this update
it will go through the postIndex method and end up added that watch to the
trigger service (and stats). Functionally this is not a problem, if Watcher
is stopping or stopped since Watcher is also paused and will not execute
the watch. However, with specific timing and expectations of a clean slate
can cause issues the test assertions against the stats.

This commit ensures that the postIndex action only adds to the trigger service
if the Watcher state is not stopping or stopped. When started back up it will
re-read index .watches.

This commit also un-mutes the tests related to elastic#53177 and elastic#56534
jakelandis added a commit to jakelandis/elasticsearch that referenced this issue May 12, 2020
Watcher adds watches to the trigger service on the postIndex action
for the .watches index. This has the (intentional) side effect of also
adding the watches to the stats. The tests rely on these stats for their
assertions. The tests also start and stop Watcher between each test for
a clean slate.

When Watcher executes it updates the .watches index and upon this update
it will go through the postIndex method and end up added that watch to the
trigger service (and stats). Functionally this is not a problem, if Watcher
is stopping or stopped since Watcher is also paused and will not execute
the watch. However, with specific timing and expectations of a clean slate
can cause issues the test assertions against the stats.

This commit ensures that the postIndex action only adds to the trigger service
if the Watcher state is not stopping or stopped. When started back up it will
re-read index .watches.

This commit also un-mutes the tests related to elastic#53177 and elastic#56534
jakelandis added a commit to jakelandis/elasticsearch that referenced this issue May 12, 2020
Watcher adds watches to the trigger service on the postIndex action
for the .watches index. This has the (intentional) side effect of also
adding the watches to the stats. The tests rely on these stats for their
assertions. The tests also start and stop Watcher between each test for
a clean slate.

When Watcher executes it updates the .watches index and upon this update
it will go through the postIndex method and end up added that watch to the
trigger service (and stats). Functionally this is not a problem, if Watcher
is stopping or stopped since Watcher is also paused and will not execute
the watch. However, with specific timing and expectations of a clean slate
can cause issues the test assertions against the stats.

This commit ensures that the postIndex action only adds to the trigger service
if the Watcher state is not stopping or stopped. When started back up it will
re-read index .watches.

This commit also un-mutes the tests related to elastic#53177 and elastic#56534
@jakelandis
Copy link
Contributor

I am hopeful that #56556 stabilizes these tests.

jakelandis added a commit that referenced this issue May 12, 2020
Watcher adds watches to the trigger service on the postIndex action
for the .watches index. This has the (intentional) side effect of also
adding the watches to the stats. The tests rely on these stats for their
assertions. The tests also start and stop Watcher between each test for
a clean slate.

When Watcher executes it updates the .watches index and upon this update
it will go through the postIndex method and end up added that watch to the
trigger service (and stats). Functionally this is not a problem, if Watcher
is stopping or stopped since Watcher is also paused and will not execute
the watch. However, with specific timing and expectations of a clean slate
can cause issues the test assertions against the stats.

This commit ensures that the postIndex action only adds to the trigger service
if the Watcher state is not stopping or stopped. When started back up it will
re-read index .watches.

This commit also un-mutes the tests related to #53177 and #56534
jakelandis added a commit that referenced this issue May 12, 2020
Watcher adds watches to the trigger service on the postIndex action
for the .watches index. This has the (intentional) side effect of also
adding the watches to the stats. The tests rely on these stats for their
assertions. The tests also start and stop Watcher between each test for
a clean slate.

When Watcher executes it updates the .watches index and upon this update
it will go through the postIndex method and end up added that watch to the
trigger service (and stats). Functionally this is not a problem, if Watcher
is stopping or stopped since Watcher is also paused and will not execute
the watch. However, with specific timing and expectations of a clean slate
can cause issues the test assertions against the stats.

This commit ensures that the postIndex action only adds to the trigger service
if the Watcher state is not stopping or stopped. When started back up it will
re-read index .watches.

This commit also un-mutes the tests related to #53177 and #56534
jakelandis added a commit that referenced this issue May 12, 2020
Watcher adds watches to the trigger service on the postIndex action
for the .watches index. This has the (intentional) side effect of also
adding the watches to the stats. The tests rely on these stats for their
assertions. The tests also start and stop Watcher between each test for
a clean slate.

When Watcher executes it updates the .watches index and upon this update
it will go through the postIndex method and end up added that watch to the
trigger service (and stats). Functionally this is not a problem, if Watcher
is stopping or stopped since Watcher is also paused and will not execute
the watch. However, with specific timing and expectations of a clean slate
can cause issues the test assertions against the stats.

This commit ensures that the postIndex action only adds to the trigger service
if the Watcher state is not stopping or stopped. When started back up it will
re-read index .watches.

This commit also un-mutes the tests related to #53177 and #56534
@jakelandis
Copy link
Contributor

jakelandis commented Jun 2, 2020

No failures since is the introduction of #56556 (its hard to tell from the screenshot but that failure is on May 11 just prior to the tests getting muted)

Closing this issue.

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Watcher Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

10 participants