Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] GeoIpDownloaderIT testStartWithNoDatabases failing #79074

Closed
jrodewig opened this issue Oct 13, 2021 · 18 comments · Fixed by #79131
Closed

[CI] GeoIpDownloaderIT testStartWithNoDatabases failing #79074

jrodewig opened this issue Oct 13, 2021 · 18 comments · Fixed by #79131
Assignees
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@jrodewig
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/uhhnoantetlfq/tests/:modules:ingest-geoip:internalClusterTest/org.elasticsearch.ingest.geoip.GeoIpDownloaderIT/testStartWithNoDatabases

Reproduction line:
./gradlew ':modules:ingest-geoip:internalClusterTest' --tests "org.elasticsearch.ingest.geoip.GeoIpDownloaderIT.testStartWithNoDatabases" -Dtests.seed=976EEDC6532D0E7E -Dtests.locale=es-HN -Dtests.timezone=America/Thunder_Bay -Druntime.java=11

Applicable branches:
master

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.ingest.geoip.GeoIpDownloaderIT&tests.test=testStartWithNoDatabases

Failure excerpt:

java.lang.AssertionError: 
Expected: map containing ["tags"-><[_geoip_database_unavailable_GeoLite2-City.mmdb, _geoip_database_unavailable_GeoLite2-Country.mmdb, _geoip_database_unavailable_GeoLite2-ASN.mmdb]>]
     but: map was [<_id=_id>, <ip-asn={ip=89.160.20.128, organization_name=Bredband2 AB, asn=29518, network=89.160.0.0/17}>, <_index=my-index>, <ip-city={continent_name=Europe, region_iso_code=SE-AB, city_name=Tumba, country_iso_code=SE, country_name=Sweden, region_name=Stockholm, location={lon=17.8167, lat=59.2}}>, <ip-country={continent_name=Europe, country_iso_code=SE, country_name=Sweden}>, <ip=89.160.20.128>]

  at __randomizedtesting.SeedInfo.seed([976EEDC6532D0E7E:5AD1C00E49EC69D]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.junit.Assert.assertThat(Assert.java:923)
  at org.elasticsearch.ingest.geoip.GeoIpDownloaderIT.testStartWithNoDatabases(GeoIpDownloaderIT.java:324)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:566)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:834)

@jrodewig jrodewig added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >test-failure Triaged test failures from CI labels Oct 13, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Oct 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@cbuescher
Copy link
Member

Can we mute this please? It fails one of my PRs with a pretty high frequency.

@jrodewig
Copy link
Contributor Author

Muted with 896bb85.

Sorry about that. I hadn't seen any other failures, and it didn't reproduce locally.

@martijnvg martijnvg self-assigned this Oct 13, 2021
@martijnvg
Copy link
Member

Sorry about the noise. I recently added this test in a change that went only in master. I will fix this.

@danhermann
Copy link
Contributor

Might this other test failure in GeoIpDownloaderIT be related? https://gradle-enterprise.elastic.co/s/neooukkmmjfdo

@martijnvg
Copy link
Member

@danhermann I think so. Can you mute this test?

@danhermann
Copy link
Contributor

@danhermann I think so. Can you mute this test?

Yes, muting in #79089

@martijnvg
Copy link
Member

thanks!

@martijnvg
Copy link
Member

I was able to reproduce this failure. Another test adds test database in the config directory of all nodes. The test cluster is reused, but these databases were not cleaned up by this tests, which broke the assumption of testStartWithNoDatabases () test (which is there are no databases at startup).

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Oct 14, 2021
In the case a database couldn't be loaded, the geoip processor factory
checks whether any databases are available and then returns a processor
implementation that tags documents with the fact that required database
wasn't available. The GeoIpProcessor itself also loads the database, but
in case a database can't be loaded then it always fails with resource
missing exception. The GeoIpProcessor is modified in this change to
also check whether any database is available and in that case tag
documents instead of failing.

GeoIpDownloaderIT improvements:
* The `testUseGeoIpProcessorWithDownloadedDBs()` was adding databases to config dirs,
  but not cleaning it up. Which broke assumptions in others in this suite, because
  the test cluster is reused.
* Use the geoip stats api after each test to wait for a clean state, which means
  wait for database downloader to be disabled and all database files to be removed
  on all ingest nodes.
* Don't use `IngestDocument#getFieldValue(...)` in test code surrounded by `assertBusy(...)`.
  If a field isn't there an illegal state exception is thrown, which isn't caught by
  `assertBusy(...)`. Only assertion errors are handled.

Closes elastic#79074
martijnvg added a commit that referenced this issue Oct 14, 2021
…g. (#79131)

In the case a database couldn't be loaded, the geoip processor factory
checks whether any databases are available and then returns a processor
implementation that tags documents with the fact that required database
wasn't available. The GeoIpProcessor itself also loads the database, but
in case a database can't be loaded then it always fails with resource
missing exception. The GeoIpProcessor is modified in this change to
also check whether any database is available and in that case tag
documents instead of failing.

GeoIpDownloaderIT improvements:
* The `testUseGeoIpProcessorWithDownloadedDBs()` was adding databases to config dirs,
  but not cleaning it up. Which broke assumptions in others in this suite, because
  the test cluster is reused.
* Use the geoip stats api after each test to wait for a clean state, which means
  wait for database downloader to be disabled and all database files to be removed
  on all ingest nodes.
* Don't use `IngestDocument#getFieldValue(...)` in test code surrounded by `assertBusy(...)`.
  If a field isn't there an illegal state exception is thrown, which isn't caught by
  `assertBusy(...)`. Only assertion errors are handled.

Closes #79074
@fcofdez
Copy link
Contributor

fcofdez commented Oct 14, 2021

I'm not sure if the test is fixed, I've just got a failure in https://gradle-enterprise.elastic.co/s/6fisos7fmdfc4

@fcofdez
Copy link
Contributor

fcofdez commented Oct 14, 2021

I'm reopening this since it seems like the issue still persists

@fcofdez fcofdez reopened this Oct 14, 2021
@martijnvg
Copy link
Member

I'm looking into this new failure. (i was pretty sure the test was fixed... i guess I was wrong!)

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Oct 14, 2021
martijnvg added a commit that referenced this issue Oct 15, 2021
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Oct 15, 2021
If builtin database can't be loaded then assume it will be available soon
via database geoip downloading mechanism. So instead of returning a
config error, returns a geoip processor impl that tags documents with
the fact that a builtin db isn't yet avaialble.

Relates to elastic#79074
martijnvg added a commit that referenced this issue Oct 15, 2021
* Try to clean config databases after each test (instead of 1).
* Improve assertions in testUseGeoIpProcessorWithDownloadedDBs()
* Wait for managed databases to be deleted in testUseGeoIpProcessorWithDownloadedDBs().

Relates to #79074
@DaveCTurner
Copy link
Contributor

I had a failure of GeoIpDownloaderCliIT#testStartWithNoDatabases in https://gradle-enterprise.elastic.co/s/6ln3sxuv2ohhm, I'm assuming it's related to this rather than a separate thing so just linking it here.

@martijnvg
Copy link
Member

I'm unable to reproduce this failure locally. I'm going to re-enable this test with geoip debug logging enabled. When it fails there should be more data to investigate why the pipeline with geoip processor isn't reloaded after the databases are available. If this test fails without me noticing then please mute this test and share the Gradle link.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Oct 27, 2021
martijnvg added a commit that referenced this issue Oct 27, 2021
…79240)

If builtin database can't be loaded then assume it will be available soon
via database geoip downloading mechanism. So instead of returning a
config error, returns a geoip processor impl that tags documents with
the fact that a builtin db isn't yet avaialble.

Relates to #79074
@DaveCTurner
Copy link
Contributor

Failed for me on a PR build here: https://gradle-enterprise.elastic.co/s/miosturet4dp4

@martijnvg
Copy link
Member

I looked into this failure and this instance the simulate pipeline api was repeatedly executed on a coordinating only node. The databases are only loaded on nodes with ingest role, this causes the geoip lookup never to happen. The simulate pipeline api always executes on the first node it hits, I think we should change that (redirect to a node with ingest role instead, or wait until we drop the ingest role...). For now, I will adjust the test to not hit a coordinating only node.

Note that this isn't an issue for bulk/index api (if an ingest pipeline is used the these APIs always redirect to a node with an ingest role).

@martijnvg
Copy link
Member

martijnvg commented Nov 1, 2021

Muting this test now. My latest tweak to the test, apparently made things worse.
UPDATE: I was confused. Gradle Enterprise showed me the failures from my local machine. So no actual failure happened on regular and PR CI runs. I will unmute...

martijnvg added a commit that referenced this issue Nov 1, 2021
martijnvg added a commit that referenced this issue Nov 2, 2021
@martijnvg
Copy link
Member

The test hasn't failed, since 4db5fc5 has been pushed. Closing it for now, if this test fails again then this issue can re-opened or a new issue can be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants