Skip to content

Conversation

@rubvs
Copy link
Contributor

@rubvs rubvs commented Aug 18, 2025

Closes #17603

This system test should fail if there are any tags in the indexed doc. This test ensure that if, for some reason, we cannot connect or download the geoIP database, the system tests will fail.

The systems test leverage docker volumes to link the geoip database, see docker-compose. Therefore, properly testing this PR requires some manual steps.

  • Either delete or comment the docker volume line linked above.
  • Ensure you are pulling a clean ES image. I typically like to keep my local clean, so I'll remove all containers and images:
# Remove all running container instances
> docker rm -f $(docker ps -a -q)

# Remove all built images
> docker rmi -f $(docker images -a -q)
  • Running make system-test should fail with the following error:
> make system-test

rum_test.go:94: unexpected tags field in document: [_geoip_database_unavailable_GeoLite2-City.mmdb]
  • Repeat, but with the ingest-geoip volume enabled in docker-compose.yml, then the system test will pass.

@github-actions
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Aug 18, 2025

This pull request does not have a backport label. Could you fix it @rubvs? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-9./d is the label to automatically backport to the 9./d branch. /d is the digit.
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@rubvs rubvs marked this pull request as ready for review August 18, 2025 20:47
@rubvs rubvs requested a review from a team as a code owner August 18, 2025 20:47
@rubvs rubvs requested a review from simitt August 18, 2025 20:47
],
"agent.version": [
"5.5.0"
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this changed ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the client.geo expansion should not be removed.

func TestRUMGeoIpTags(t *testing.T) {
systemtest.CleanupElasticsearch(t)

// Ensure the geo-database does not get downloaded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something but I think the issue was about testing the geoip is correctly downloaded and applied. This PR seems to do the opposite 🤔


r := esapi.ClusterPutSettingsRequest{
Body: bytes.NewReader(b),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this be set to false? The goal is to trigger the geoip downloader and have the relevant fields in the ingested events geoip enriched.
The tags field would only be set if an error happened during this download or enrichment, so we want to check that tags is empty.

@rubvs
Copy link
Contributor Author

rubvs commented Aug 20, 2025

@simitt, there is something flaky here.

The bind volume (ingest-geoip) in the docker-compose has been bugging me, since this means our tests are never downloading the geo db.

I've also noticed, the db does net get installed properly the first time I run the systems tests. Even with the cluster settings enabled for downloading. The second time I run make system-test, things work as expected.

After looking into this a bit, I see the original Issue, solved in PR, introduced the docker volume for exactly this flaky reason.

I'll have a deeper look at this tomorrow.

@simitt
Copy link
Contributor

simitt commented Aug 20, 2025

I've also noticed, the db does net get installed properly the first time I run the systems tests. Even with the cluster settings enabled for downloading. The second time I run make system-test, things work as expected.

Please dig a bit more into why this fails on the first try and succeeds on second. Is it a timing issue?

@rubvs
Copy link
Contributor Author

rubvs commented Aug 20, 2025

It doesn't seem like a timing issue from my debugging yesterday. There was a previous PR that would loop twice for this same reason: b5d52d7. I'll look at this today.

@rubvs
Copy link
Contributor Author

rubvs commented Aug 21, 2025

Quick update:

Yesterday

I looked into why downloading the GeoIp database is so inconsistent. I added the following download funtions:

func waitGeoIPDatabase(ctx context.Context) error {
	b, err := json.Marshal(map[string]any{
		"persistent": map[string]any{
			"ingest.geoip.downloader.enabled":        true,
			"ingest.geoip.downloader.eager.download": true,  // <- Crucial
		},
	})
	r := esapi.ClusterPutSettingsRequest{
		Body: bytes.NewReader(b),
	}
	res, err := r.Do(ctx, Elasticsearch)
	if err != nil {
		return err
	}
	defer res.Body.Close()
	if res.IsError() {
		return fmt.Errorf("response err: %d", res.StatusCode)
	}

	var geoIpResponse struct {
		Databases []struct {
			Id       string `json:"id"`
			Database struct {
				Name string `json:"name"`
			} `json:"database"`
		} `json:"databases"`
	}

	first := true
	for {
		gr, err := Elasticsearch.Ingest.GetGeoipDatabase()
		if err != nil {
			return err
		}
		if err := json.NewDecoder(gr.Body).Decode(&geoIpResponse); err != nil {
			return err
		}
		if len(geoIpResponse.Databases) > 0 {
			return nil
		}
		if first {
			log.Printf("Waiting for geoIP database to be downloaded")
			first = false
		}
		time.Sleep(waitHealthyInterval)
	}
}

The critical part, is the addition if "ingest.geoip.downloader.eager.download": true. Basically, the cluster settings as defined in the code above states:

ingest.geoip.downloader.enabled translates to "it is permissible to download the files when you feel like it", and ingest.geoip.downloader.eager.download to "and that time is now".

With this setup, and a wait loop, we are able to consistently download the db. However, on my local machine this takes some time - around 20s.

Today

With the downloading taking some time, it's safe to assume this will lead to some networking issues in CI, hence the solution outlined in PR.

I have been looking into programmatically manipulating the volume to toggle the presence of the database. Updates will follow.

@axw
Copy link
Member

axw commented Aug 21, 2025

I have been looking into programmatically manipulating the volume to toggle the presence of the database. Updates will follow.

@rubvs another option to consider: maybe we can intentionally break the ingest processor configuration by updating the database_file configuration to something that doesn't exist. It's not exactly the same root cause, but similar enough IMO.

@rubvs
Copy link
Contributor Author

rubvs commented Aug 22, 2025

Created a PR with (un)mount the local database file instead of downloading: #18277

@axw, do you mean renaming the mounted database file?

@axw
Copy link
Member

axw commented Aug 22, 2025

@axw, do you mean renaming the mounted database file?

No, I mean modify the ingest pipeline. Your approach sounds fine.

@rubvs
Copy link
Contributor Author

rubvs commented Aug 22, 2025

@axw oh I see the database_file defined Here. Okay, let me think about this tomorrow.

@rubvs
Copy link
Contributor Author

rubvs commented Aug 27, 2025

Superseded by #18325

@rubvs rubvs closed this Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add test for the absence of tags in apm documents

5 participants