Distributor accept multiple HA Tracker pairs in the same request #6278

eduardscaueru · 2024-10-20T12:59:59Z

What this PR does:
Added new implementation that makes the distributor accept multiple HA pairs (cluster, replica) in the same requets/batch. This can be enabled with a new flag, accept_mixed_ha_samples, an will take effect only if accept_ha_samples is set to true.

This implementation check every timeseries from the request if it has both cluster and replica labels. If yes, then it checks the KV store to see if it matches with the elected replica. If not, the current timeseries is discarded (not added to the validatedTimeseries) and the rest of the batch moves on.

It also ensures that when the cluster label is missing the replica label is not removed.

Which issue(s) this PR fixes:
Fixes #6256

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

SungJin1212 · 2024-10-22T11:34:10Z

@eduardscaueru
Thank you for your contribution.

For me, #6278 looks more good. I feel like it's more readable and it didn't change the function parameters.

Can you add an e2e test to make sure accept_mixed_ha_samples correctly works as a runtime config?

pkg/distributor/distributor.go

eduardscaueru · 2024-10-23T19:39:53Z

@SungJin1212 @friedrichg I added a e2e test for this feature, let me know if it is ok.

Regarding the other approach, PR 6279, I feel like it is more efficient in the scenario where multiple timeseries with the same HA pairs come in the same request. Mostly, if this happens, then the client cannot aggregate them and push all samples for each pair, and will have only one sample/datapoint per timeseries, therefore making a call to the KV store for all of them.

The precomputed map of valid HA pairs it is efficient since if the pair is already in the map, it won't make another call to the KV store and to check if the timeseries is valid just checks the map in O(1).

I know I said in the in the issue that this PR is more suitable, but I would incline more for PR 6279 given the scenario from above.

SungJin1212 · 2024-10-24T01:53:03Z

@eduardscaueru
The number of calls to checkSample seems to be the same, doesn't it? I could be wrong. The e2e test doesn't show that accept_mixed_ha_samples works as a runtime config.

eduardscaueru · 2024-10-24T07:28:49Z

@eduardscaueru The number of calls to checkSample seems to be the same, doesn't it? I could be wrong. The e2e test doesn't show that accept_mixed_ha_samples works as a runtime config.

@SungJin1212 The checkSample method will be called only one time for a unique pair. These unique pairs are computed with findAllHAPairs.

Sorry, I thought that runtime config would be by enabling the CLI flag for it. Could you please guide me on what needs to be done to enable it?

SungJin1212 · 2024-10-24T08:06:27Z

@eduardscaueru
I see. Thanks for the explanation, #6279 would be better in performance-wise. I'd like to hear other people's opinions.

The cortex watches a runtime config, it is applied to the cortex at the runtime periodically.
I thought the test should ensure a tenant can use a accept_mixed_ha_samples using the runtime config when the accept_mixed_ha_samples is set to false globally.

SungJin1212 · 2024-10-24T08:12:54Z

@eduardscaueru
I'm just wondering about a use case for this feature.

eduardscaueru · 2024-10-24T08:19:04Z

@eduardscaueru I'm just wondering about a use case for this feature.

I am thinking when the remote write source is other than a Prometheus one, as described in #6256, and datapoints will be mixed in the same request.

eduardscaueru · 2024-10-24T09:06:27Z

@eduardscaueru I see. Thanks for the explanation, #6279 would be better in performance-wise. I'd like to hear other people's opinions.

The cortex watches a runtime config, it is applied to the cortex at the runtime periodically. I thought the test should ensure a tenant can use a accept_mixed_ha_samples using the runtime config when the accept_mixed_ha_samples is set to false globally.

@SungJin1212 I think I get what you mean now. So after the assertions in the e2e test, the runtime config should override the accept_mixed_ha_samples to false, then push again and assert the changed behaviour is picked up as expected. If this is the case, do you have a reference on how can I change the config dynamically, maybe other test, cause I cannot find something right now.

SungJin1212 · 2024-10-24T09:10:07Z

@eduardscaueru
It is sufficient to test that accept_mixed_ha_samples can be overridden.
It would be better to add experimental flags.

eduardscaueru · 2024-10-24T09:14:01Z

@eduardscaueru It is sufficient to test that accept_mixed_ha_samples can be overridden.

@SungJin1212 hmmm... then a change to the runtime_config_test.go test should work?

SungJin1212 · 2024-10-24T09:25:12Z

@eduardscaueru
I think it would be good to append the runtime config setting in your test.

SungJin1212 · 2024-10-24T10:00:45Z

pkg/distributor/distributor.go

+				if err != nil {
+					// discard sample
+					d.dedupedSamples.WithLabelValues(userID, cluster).Add(float64(len(ts.Samples) + len(ts.Histograms)))
+					continue


errors should be handled differently (CAS error, TooManyReplicaGroupsError)

Do you want to set the status 404 Bad request for the response if there is a timeseries with TooManyReplicaGroupsError?

eduardscaueru · 2024-10-24T17:20:37Z

@SungJin1212 Added the experimental flag, but I still could not find how to set the runtime config. I only saw a runtime config for store bucket and I tried to integrate it as in one of query frontend tests but without success.

Could please point me to a file or test that does it already? Or an object that I should create?

SungJin1212 · 2024-10-25T01:43:17Z

@eduardscaueru
The accept_mixed_ha_samples is already registered in the runtime config as you added accept_mixed_ha_samples to limits.

eduardscaueru · 2024-10-25T10:51:49Z

@SungJin1212 Thank you for your help. Just wanted to point out that I refactored #6279 to not change any function param and treat errors differently.

friedrichg

Sorry for the wait. Looks good to me.

can you add the feature to

cortex/docs/configuration/v1-guarantees.md

Line 4 in 7c973ee

weight: 6

and rebase?

integration/distributor_mixed_ha_samples_test.go

yeya24

The implementation makes sense to me. Thanks!

pkg/util/validation/limits.go

yeya24 · 2024-11-17T19:01:36Z

Still need to fix doc

eduardscaueru · 2024-11-17T20:32:24Z

Still need to fix doc

Sorry, I forgot to add the [Experimental] in limits. Now I hope it is fine, but I don't know why the integration query fuzz test failed.

yeya24 · 2024-11-17T20:44:22Z

but I don't know why the integration query fuzz test failed.

This is unrelated. Have you rebased latest master? This might help it go through

eduardscaueru · 2024-11-17T20:53:49Z

but I don't know why the integration query fuzz test failed.

This is unrelated. Have you rebased latest master? This might help it go through

Nope. Let me do that.

eduardscaueru · 2024-11-17T21:05:34Z

but I don't know why the integration query fuzz test failed.

This is unrelated. Have you rebased latest master? This might help it go through

@yeya24 Updated to the latest changes. Should I squash the commits?

…A pairs (cluster, replica) in the same requets/batch. This can be enabled with a new flag, accept_mixed_ha_samples, an will take effect only if accept_ha_samples is set to true. Fixed test by reducing the number of ingesters to 2 and replication factor to 2. Added config reference. Do not remove replica label if cluster label is not present. Added more HA mixed replicas tests with no cluster and replica labels and with cluster label only. Added e2e test for mixed HA samples in the same request. Refactored distributor mixed HA samples logic. Added experimental flag for accept_mixed_ha_samples. Handled ReplicasNotMatchError TooManyReplicaGroupsError differently. Signed-off-by: eduardscaueru <edi_scaueru@yahoo.com>

yeya24 · 2024-11-17T23:51:31Z

@eduardscaueru It is better to keep separate commits for easier review and Github allows us to squash all commits when merging the PR.

eduardscaueru · 2024-11-18T00:14:45Z

@eduardscaueru It is better to keep separate commits for easier review and Github allows us to squash all commits when merging the PR.

Ack. I assumed the PR is ready to merge and I squashed them, sorry.

pull-request-size bot added the size/L label Oct 20, 2024

dosubot bot added the component/distributor label Oct 20, 2024

eduardscaueru mentioned this pull request Oct 20, 2024

[Feature Request] HA Tracker support for multiple prometheus replicas in the same batch #6256

Closed

eduardscaueru force-pushed the ha_tracker_mixed_replicas branch 7 times, most recently from 48783b3 to dd9bf96 Compare October 21, 2024 15:23

friedrichg self-requested a review October 22, 2024 10:19

SungJin1212 reviewed Oct 22, 2024

View reviewed changes

pkg/distributor/distributor.go Outdated Show resolved Hide resolved

eduardscaueru force-pushed the ha_tracker_mixed_replicas branch from dd9bf96 to e945eb4 Compare October 23, 2024 19:18

SungJin1212 reviewed Oct 24, 2024

View reviewed changes

eduardscaueru force-pushed the ha_tracker_mixed_replicas branch from e945eb4 to 4498e49 Compare October 24, 2024 17:15

eduardscaueru force-pushed the ha_tracker_mixed_replicas branch from 4498e49 to acfddf6 Compare October 24, 2024 17:22

friedrichg reviewed Nov 15, 2024

View reviewed changes

friedrichg requested review from CharlieTLe, yeya24 and alanprot November 15, 2024 20:15

friedrichg reviewed Nov 15, 2024

View reviewed changes

integration/distributor_mixed_ha_samples_test.go Outdated Show resolved Hide resolved

eduardscaueru force-pushed the ha_tracker_mixed_replicas branch 2 times, most recently from 982b224 to d06bd61 Compare November 16, 2024 16:30

yeya24 approved these changes Nov 17, 2024

View reviewed changes

pkg/util/validation/limits.go Outdated Show resolved Hide resolved

eduardscaueru force-pushed the ha_tracker_mixed_replicas branch 2 times, most recently from 9402ea7 to 1ea3bae Compare November 17, 2024 20:30

eduardscaueru force-pushed the ha_tracker_mixed_replicas branch from 91e94ec to 8d37ae9 Compare November 17, 2024 21:35

friedrichg approved these changes Nov 18, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 18, 2024

friedrichg merged commit f875d91 into cortexproject:master Nov 19, 2024
16 checks passed

friedrichg mentioned this pull request Nov 19, 2024

Distributor accept multiple HA Tracker pairs in the same request (find first) #6279

Closed

3 tasks

SungJin1212 mentioned this pull request Nov 22, 2024

Add docs for accept multiple ha pairs #6364

Merged

3 tasks

BrewTestBot mentioned this pull request Feb 27, 2025

cortex 1.19.0 Homebrew/homebrew-core#209140

Merged

Distributor accept multiple HA Tracker pairs in the same request #6278

Distributor accept multiple HA Tracker pairs in the same request #6278

Uh oh!

Conversation

eduardscaueru commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SungJin1212 commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

eduardscaueru commented Oct 23, 2024

Uh oh!

SungJin1212 commented Oct 24, 2024

Uh oh!

eduardscaueru commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SungJin1212 commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SungJin1212 commented Oct 24, 2024

Uh oh!

eduardscaueru commented Oct 24, 2024

Uh oh!

eduardscaueru commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SungJin1212 commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eduardscaueru commented Oct 24, 2024

Uh oh!

SungJin1212 commented Oct 24, 2024

Uh oh!

SungJin1212 Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eduardscaueru Oct 24, 2024

Choose a reason for hiding this comment

Uh oh!

eduardscaueru commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SungJin1212 commented Oct 25, 2024

Uh oh!

eduardscaueru commented Oct 25, 2024

Uh oh!

friedrichg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yeya24 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yeya24 commented Nov 17, 2024

Uh oh!

eduardscaueru commented Nov 17, 2024

Uh oh!

yeya24 commented Nov 17, 2024

Uh oh!

eduardscaueru commented Nov 17, 2024

Uh oh!

eduardscaueru commented Nov 17, 2024

Uh oh!

yeya24 commented Nov 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eduardscaueru commented Nov 18, 2024

Uh oh!

Uh oh!

Uh oh!

eduardscaueru commented Oct 20, 2024 •

edited

Loading

SungJin1212 commented Oct 22, 2024 •

edited

Loading

eduardscaueru commented Oct 24, 2024 •

edited

Loading

SungJin1212 commented Oct 24, 2024 •

edited

Loading

eduardscaueru commented Oct 24, 2024 •

edited

Loading

SungJin1212 commented Oct 24, 2024 •

edited

Loading

SungJin1212 Oct 24, 2024 •

edited

Loading

eduardscaueru commented Oct 24, 2024 •

edited

Loading

yeya24 commented Nov 17, 2024 •

edited

Loading