Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] GeoIPv2 #68920

Closed
15 tasks done
probakowski opened this issue Feb 11, 2021 · 4 comments
Closed
15 tasks done

[Meta] GeoIPv2 #68920

probakowski opened this issue Feb 11, 2021 · 4 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Meta release highlight Team:Data Management Meta label for data/management team v7.14.0 v8.0.0-alpha1

Comments

@probakowski
Copy link
Contributor

probakowski commented Feb 11, 2021

This is meta issue to track progress of GeoIPv2 work. There are currently these identified items we need to complete for GA:

@probakowski probakowski added Meta :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 labels Feb 11, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Feb 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

probakowski added a commit that referenced this issue Feb 23, 2021
This change adds component that will download new GeoIP databases from infra service
New databases are downloaded in chunks and stored in .geoip_databases index
Downloads are verified against MD5 checksum provided by the server
Current state of all stored databases is stored in cluster state in persistent task state

Relates to #68920
probakowski added a commit to probakowski/elasticsearch that referenced this issue Feb 23, 2021
This change adds component that will download new GeoIP databases from infra service
New databases are downloaded in chunks and stored in .geoip_databases index
Downloads are verified against MD5 checksum provided by the server
Current state of all stored databases is stored in cluster state in persistent task state

Relates to elastic#68920
# Conflicts:
#	modules/ingest-geoip/src/main/java/org/elasticsearch/ingest/geoip/IngestGeoIpPlugin.java
probakowski added a commit that referenced this issue Feb 24, 2021
This change adds component that will download new GeoIP databases from infra service
New databases are downloaded in chunks and stored in .geoip_databases index
Downloads are verified against MD5 checksum provided by the server
Current state of all stored databases is stored in cluster state in persistent task state

Relates to #68920
probakowski added a commit that referenced this issue Feb 24, 2021
This change adds query parameter confirming that we accept ToS of GeoIP database service provided by Infra.
It also changes integration test to use lower timeout when using local fixture.

Relates to #68920
probakowski added a commit to probakowski/elasticsearch that referenced this issue Feb 24, 2021
This change adds query parameter confirming that we accept ToS of GeoIP database service provided by Infra.
It also changes integration test to use lower timeout when using local fixture.

Relates to elastic#68920
probakowski added a commit that referenced this issue Feb 24, 2021
This change adds query parameter confirming that we accept ToS of GeoIP database service provided by Infra.
It also changes integration test to use lower timeout when using local fixture.

Relates to #68920
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 24, 2021
…ownloader

This component is responsible for making the databases maintained by GeoIpDownloader
available for ingest processors.

Also provided a lookup mechanism for geoip processors with fallback to {@link LocalDatabases}.
All databases are downloaded into a geoip tmp directory, which is created at node startup.

The following high level steps are executed after each cluster state update:
1) Check which databases are available in {@link GeoIpTaskState},
   which is part of the geoip downloader persistent task.
2) For each database check whether the databases have changed
   by comparing the local and remote md5 hash or are locally missing.
3) For each database identified in step 2 start downloading the database
   chunks. Each chunks is appended to a tmp file (inside geoip tmp dir) and
   after all chunks have been downloaded, the database is uncompressed and
   renamed to the final filename.After this the database is loaded and
   if there is an old instance of this database then that is closed.
4) Cleanup locally loaded databases that are no longer mentioned in {@link GeoIpTaskState}.

Relates to elastic#68920
martijnvg added a commit that referenced this issue Mar 4, 2021
…ownloader (#69540)

This component is responsible for making the databases maintained by GeoIpDownloader
available for ingest processors.

Also provided a lookup mechanism for geoip processors with fallback to {@link LocalDatabases}.
All databases are downloaded into a geoip tmp directory, which is created at node startup.

The following high level steps are executed after each cluster state update:
1) Check which databases are available in {@link GeoIpTaskState},
   which is part of the geoip downloader persistent task.
2) For each database check whether the databases have changed
   by comparing the local and remote md5 hash or are locally missing.
3) For each database identified in step 2 start downloading the database
   chunks. Each chunks is appended to a tmp file (inside geoip tmp dir) and
   after all chunks have been downloaded, the database is uncompressed and
   renamed to the final filename. After this the database is loaded and
   if there is an old instance of this database then that is closed.
4) Cleanup locally loaded databases that are no longer mentioned in {@link GeoIpTaskState}.

Relates to #68920
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 4, 2021
…GeoIpDownloader

Backport of elastic#69540 to 7.x branch.

This component is responsible for making the databases maintained by GeoIpDownloader
available for ingest processors.

Also provided a lookup mechanism for geoip processors with fallback to {@link LocalDatabases}.
All databases are downloaded into a geoip tmp directory, which is created at node startup.

The following high level steps are executed after each cluster state update:
1) Check which databases are available in {@link GeoIpTaskState},
   which is part of the geoip downloader persistent task.
2) For each database check whether the databases have changed
   by comparing the local and remote md5 hash or are locally missing.
3) For each database identified in step 2 start downloading the database
   chunks. Each chunks is appended to a tmp file (inside geoip tmp dir) and
   after all chunks have been downloaded, the database is uncompressed and
   renamed to the final filename. After this the database is loaded and
   if there is an old instance of this database then that is closed.
4) Cleanup locally loaded databases that are no longer mentioned in {@link GeoIpTaskState}.

Relates to elastic#68920
martijnvg added a commit that referenced this issue Mar 10, 2021
…ownloader (#69971)

Backport of #69540 to 7.x branch.

This component is responsible for making the databases maintained by GeoIpDownloader
available for ingest processors.

Also provided a lookup mechanism for geoip processors with fallback to {@link LocalDatabases}.
All databases are downloaded into a geoip tmp directory, which is created at node startup.

The following high level steps are executed after each cluster state update:
1) Check which databases are available in {@link GeoIpTaskState},
   which is part of the geoip downloader persistent task.
2) For each database check whether the databases have changed
   by comparing the local and remote md5 hash or are locally missing.
3) For each database identified in step 2 start downloading the database
   chunks. Each chunks is appended to a tmp file (inside geoip tmp dir) and
   after all chunks have been downloaded, the database is uncompressed and
   renamed to the final filename. After this the database is loaded and
   if there is an old instance of this database then that is closed.
4) Cleanup locally loaded databases that are no longer mentioned in {@link GeoIpTaskState}.

Relates to #68920

Other cherry-picked commits:

* Fix ReloadingDatabasesWhilePerformingGeoLookupsIT (#70163)

Wait for ingest threads to stop using the DatabaseReaderLazyLoader, so the during the next run the db update thread doesn't try to remove the db again (because the file hasn't yet been deleted).

Also delete tmp dirs this test create at the end of the test, so that when repeating this test many times, this test doesn't accumulate many directories with database files.

Closes #69980

* Fix clean up of old entries in DatabaseRegistry.initialize (#70135)

This change switches clean up in DatabaseRegistry.initialize from using Files.walk and stream operations to Files.walkFileTree which can be made more robust in case of errors

* Fix DatabaseRegistryTests (#70180)

This test predefined expected md5 hashes in constants, that were expected with java15.
However java16 creates different md5 hashes and so the expected md5 hashes don't match
with the actual md5 hashes, which caused tests in this test suite to fail (running
with java16 only).

The tests now generates the expected md5 hash during the test instead of using predefined constants.

Closes #69986

* Fix GeoIpDownloaderIT#testUseGeoIpProcessorWithDownloadedDBs(...) test (#70215)

The test failure looks legit, because there is a possibility that the same databases
was downloaded twice. See added comment in DatabaseRegistry class.

Relates to #69972

* Muted GeoIpDownloaderIT#testUseGeoIpProcessorWithDownloadedDBs(...) test,
see #69972

Co-authored-by: Przemko Robakowski <przemko.robakowski@elastic.co>
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 16, 2021
This change adjust where the geoip tmp directory is created
to avoid issues when running multiple nodes on the same machine.

In the java tmp dir, a 'geoip-databases' directory is created and
directly under this directory a directory with the node id as name is created.
This allows safely running multiple nodes on the same machine (this
happens mainly during tests).

Closes elastic#69972
Relates to elastic#68920
martijnvg added a commit that referenced this issue Mar 17, 2021
This change adjust where the geoip tmp directory is created
to avoid issues when running multiple nodes on the same machine.

In the java tmp dir, a 'geoip-databases' directory is created and
directly under this directory a directory with the node id as name is created.
This allows safely running multiple nodes on the same machine (this
happens mainly during tests).

Closes #69972
Relates to #68920
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 17, 2021
Backport elastic#70462 to 7.x branch.

This change adjust where the geoip tmp directory is created
to avoid issues when running multiple nodes on the same machine.

In the java tmp dir, a 'geoip-databases' directory is created and
directly under this directory a directory with the node id as name is created.
This allows safely running multiple nodes on the same machine (this
happens mainly during tests).

Closes elastic#69972
Relates to elastic#68920
martijnvg added a commit that referenced this issue Mar 17, 2021
Backport #70462 to 7.x branch.

This change adjust where the geoip tmp directory is created
to avoid issues when running multiple nodes on the same machine.

In the java tmp dir, a 'geoip-databases' directory is created and
directly under this directory a directory with the node id as name is created.
This allows safely running multiple nodes on the same machine (this
happens mainly during tests).

Closes #69972
Relates to #68920
probakowski added a commit that referenced this issue Mar 23, 2021
This change adds _geoip/stats endpoint that can be used to collect basic data about geoip downloader (successful, failed and skipped downloads, current db count and total time spent downloading).
It also fixes missing/wrong origins for clients that will break if used with security.

Relates to #68920
probakowski added a commit to probakowski/elasticsearch that referenced this issue Mar 23, 2021
This change adds _geoip/stats endpoint that can be used to collect basic data about geoip downloader (successful, failed and skipped downloads, current db count and total time spent downloading).
It also fixes missing/wrong origins for clients that will break if used with security.

Relates to elastic#68920
probakowski added a commit that referenced this issue Apr 15, 2021
This PR adds documentation for GeoIPv2 auto-update feature.
It also changes related settings names from geoip.downloader.* to ingest.geoip.downloader to have the same convention as current setting.

Relates to #68920

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
probakowski added a commit that referenced this issue Apr 15, 2021
* Enable GeoIP downloader by default (#71505)

This change enables GeoIP downloader by default.
It removes feature flag but adds flag that is used by tests to disable it again (as we don't want to hammer GeoIP database service with every test cluster we spin up).

Relates to #68920

* fix compilation

* spotless

* packaging tests

* disableGeoIpDownloader

* fix packaging
probakowski added a commit to probakowski/elasticsearch that referenced this issue Apr 15, 2021
This PR adds documentation for GeoIPv2 auto-update feature.
It also changes related settings names from geoip.downloader.* to ingest.geoip.downloader to have the same convention as current setting.

Relates to elastic#68920

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
probakowski added a commit that referenced this issue Apr 15, 2021
* Update GeoIP processor documentation (#71211)

This PR adds documentation for GeoIPv2 auto-update feature.
It also changes related settings names from geoip.downloader.* to ingest.geoip.downloader to have the same convention as current setting.

Relates to #68920

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
@csoulios csoulios added v7.13.2 and removed v7.13.1 labels Jun 2, 2021
@joegallo
Copy link
Contributor

@probakowski should this be relabeled to v7.14.0?

@probakowski
Copy link
Contributor Author

it should, I've updated it

@probakowski
Copy link
Contributor Author

As all work for 7.x is done and there's just 1 task left, I'll close this issue as done

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Oct 7, 2021
adjusted tests. Kept the `geolite2-databases` dependency for
tests only.

Relates to elastic#68920
martijnvg added a commit that referenced this issue Oct 13, 2021
* Adjusted integration tests to use geoip test fixture or to use test databases provided via config dirs (for qa module / docs).
* Kept the geolite2-databases dependency for most of the unit tests only.
* Made fallback_to_default_databases parameter on geoip processor a noop and emit deprecation warning upon using it.
* If no geoip databases are available yet to a node then the geoip processor factory returns a processor implementation that flags documents that databases are unavailable. This allows these documents to be reindex later with a pipeline. These documents will have a tag string array field, which contains a string _geoip_database_unavailable_{database_name} for each missing database in a pipeline.
* Added reload pipeline capabilities is IngestService, so that when databases are available again on a node then pipelines with geoip processor definition can be reloaded.

Relates to #68920
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Meta release highlight Team:Data Management Meta label for data/management team v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

No branches or pull requests

7 participants