Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributor accept multiple HA Tracker pairs in the same request #6278

Merged

Conversation

eduardscaueru
Copy link
Contributor

@eduardscaueru eduardscaueru commented Oct 20, 2024

What this PR does:
Added new implementation that makes the distributor accept multiple HA pairs (cluster, replica) in the same requets/batch. This can be enabled with a new flag, accept_mixed_ha_samples, an will take effect only if accept_ha_samples is set to true.

This implementation check every timeseries from the request if it has both cluster and replica labels. If yes, then it checks the KV store to see if it matches with the elected replica. If not, the current timeseries is discarded (not added to the validatedTimeseries) and the rest of the batch moves on.

It also ensures that when the cluster label is missing the replica label is not removed.

Which issue(s) this PR fixes:
Fixes #6256

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@SungJin1212
Copy link
Contributor

SungJin1212 commented Oct 22, 2024

@eduardscaueru
Thank you for your contribution.

For me, #6278 looks more good. I feel like it's more readable and it didn't change the function parameters.

Can you add an e2e test to make sure accept_mixed_ha_samples correctly works as a runtime config?

@eduardscaueru
Copy link
Contributor Author

@SungJin1212 @friedrichg I added a e2e test for this feature, let me know if it is ok.

Regarding the other approach, PR 6279, I feel like it is more efficient in the scenario where multiple timeseries with the same HA pairs come in the same request. Mostly, if this happens, then the client cannot aggregate them and push all samples for each pair, and will have only one sample/datapoint per timeseries, therefore making a call to the KV store for all of them.

The precomputed map of valid HA pairs it is efficient since if the pair is already in the map, it won't make another call to the KV store and to check if the timeseries is valid just checks the map in O(1).

I know I said in the in the issue that this PR is more suitable, but I would incline more for PR 6279 given the scenario from above.

@SungJin1212
Copy link
Contributor

@eduardscaueru
The number of calls to checkSample seems to be the same, doesn't it? I could be wrong. The e2e test doesn't show that accept_mixed_ha_samples works as a runtime config.

@eduardscaueru
Copy link
Contributor Author

eduardscaueru commented Oct 24, 2024

@eduardscaueru The number of calls to checkSample seems to be the same, doesn't it? I could be wrong. The e2e test doesn't show that accept_mixed_ha_samples works as a runtime config.

@SungJin1212 The checkSample method will be called only one time for a unique pair. These unique pairs are computed with findAllHAPairs.

Sorry, I thought that runtime config would be by enabling the CLI flag for it. Could you please guide me on what needs to be done to enable it?

@SungJin1212
Copy link
Contributor

SungJin1212 commented Oct 24, 2024

@eduardscaueru
I see. Thanks for the explanation, #6279 would be better in performance-wise. I'd like to hear other people's opinions.

The cortex watches a runtime config, it is applied to the cortex at the runtime periodically.
I thought the test should ensure a tenant can use a accept_mixed_ha_samples using the runtime config when the accept_mixed_ha_samples is set to false globally.

@SungJin1212
Copy link
Contributor

@eduardscaueru
I'm just wondering about a use case for this feature.

@eduardscaueru
Copy link
Contributor Author

@eduardscaueru I'm just wondering about a use case for this feature.

I am thinking when the remote write source is other than a Prometheus one, as described in #6256, and datapoints will be mixed in the same request.

@eduardscaueru
Copy link
Contributor Author

eduardscaueru commented Oct 24, 2024

@eduardscaueru I see. Thanks for the explanation, #6279 would be better in performance-wise. I'd like to hear other people's opinions.

The cortex watches a runtime config, it is applied to the cortex at the runtime periodically. I thought the test should ensure a tenant can use a accept_mixed_ha_samples using the runtime config when the accept_mixed_ha_samples is set to false globally.

@SungJin1212 I think I get what you mean now. So after the assertions in the e2e test, the runtime config should override the accept_mixed_ha_samples to false, then push again and assert the changed behaviour is picked up as expected. If this is the case, do you have a reference on how can I change the config dynamically, maybe other test, cause I cannot find something right now.

@SungJin1212
Copy link
Contributor

SungJin1212 commented Oct 24, 2024

@eduardscaueru
It is sufficient to test that accept_mixed_ha_samples can be overridden.
It would be better to add experimental flags.

@eduardscaueru
Copy link
Contributor Author

@eduardscaueru It is sufficient to test that accept_mixed_ha_samples can be overridden.

@SungJin1212 hmmm... then a change to the runtime_config_test.go test should work?

@SungJin1212
Copy link
Contributor

@eduardscaueru
I think it would be good to append the runtime config setting in your test.

Comment on lines 866 to 889
if err != nil {
// discard sample
d.dedupedSamples.WithLabelValues(userID, cluster).Add(float64(len(ts.Samples) + len(ts.Histograms)))
continue
Copy link
Contributor

@SungJin1212 SungJin1212 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors should be handled differently (CAS error, TooManyReplicaGroupsError)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to set the status 404 Bad request for the response if there is a timeseries with TooManyReplicaGroupsError?

@eduardscaueru
Copy link
Contributor Author

eduardscaueru commented Oct 24, 2024

@SungJin1212 Added the experimental flag, but I still could not find how to set the runtime config. I only saw a runtime config for store bucket and I tried to integrate it as in one of query frontend tests but without success.

Could please point me to a file or test that does it already? Or an object that I should create?

@SungJin1212
Copy link
Contributor

@eduardscaueru
The accept_mixed_ha_samples is already registered in the runtime config as you added accept_mixed_ha_samples to limits.

@eduardscaueru
Copy link
Contributor Author

@SungJin1212 Thank you for your help. Just wanted to point out that I refactored #6279 to not change any function param and treat errors differently.

Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the wait. Looks good to me.

can you add the feature to

and rebase?

@eduardscaueru eduardscaueru force-pushed the ha_tracker_mixed_replicas branch 2 times, most recently from 982b224 to d06bd61 Compare November 16, 2024 16:30
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation makes sense to me. Thanks!

pkg/util/validation/limits.go Outdated Show resolved Hide resolved
@yeya24
Copy link
Contributor

yeya24 commented Nov 17, 2024

Still need to fix doc

@eduardscaueru eduardscaueru force-pushed the ha_tracker_mixed_replicas branch 2 times, most recently from 9402ea7 to 1ea3bae Compare November 17, 2024 20:30
@eduardscaueru
Copy link
Contributor Author

Still need to fix doc

Sorry, I forgot to add the [Experimental] in limits. Now I hope it is fine, but I don't know why the integration query fuzz test failed.

@yeya24
Copy link
Contributor

yeya24 commented Nov 17, 2024

but I don't know why the integration query fuzz test failed.

This is unrelated. Have you rebased latest master? This might help it go through

@eduardscaueru
Copy link
Contributor Author

but I don't know why the integration query fuzz test failed.

This is unrelated. Have you rebased latest master? This might help it go through

Nope. Let me do that.

@eduardscaueru
Copy link
Contributor Author

but I don't know why the integration query fuzz test failed.

This is unrelated. Have you rebased latest master? This might help it go through

@yeya24 Updated to the latest changes. Should I squash the commits?

…A pairs (cluster, replica) in the same requets/batch. This can be enabled with a new flag, accept_mixed_ha_samples, an will take effect only if accept_ha_samples is set to true.

Fixed test by reducing the number of ingesters to 2 and replication factor to 2.

Added config reference.

Do not remove replica label if cluster label is not present.

Added more HA mixed replicas tests with no cluster and replica labels and with cluster label only.

Added e2e test for mixed HA samples in the same request. Refactored distributor mixed HA samples logic.

Added experimental flag for accept_mixed_ha_samples.

Handled ReplicasNotMatchError TooManyReplicaGroupsError differently.

Signed-off-by: eduardscaueru <edi_scaueru@yahoo.com>
@yeya24
Copy link
Contributor

yeya24 commented Nov 17, 2024

@eduardscaueru It is better to keep separate commits for easier review and Github allows us to squash all commits when merging the PR.

@eduardscaueru
Copy link
Contributor Author

@eduardscaueru It is better to keep separate commits for easier review and Github allows us to squash all commits when merging the PR.

Ack. I assumed the PR is ready to merge and I squashed them, sorry.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 18, 2024
@friedrichg friedrichg merged commit f875d91 into cortexproject:master Nov 19, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/distributor lgtm This PR has been approved by a maintainer size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] HA Tracker support for multiple prometheus replicas in the same batch
4 participants