Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): Delay serving global config in managed mode before upstream response #2697

Merged
merged 43 commits into from
Nov 21, 2023

Conversation

TBS1996
Copy link
Contributor

@TBS1996 TBS1996 commented Nov 8, 2023

Problem

currently, before we request global config from upstream, we will simply serve the default global config. This kind of implicit behaviour is not ideal.

Solution

This PR will affect the behaviour in managed mode, where we won't start processing envelopes until we have received a valid global config. To do this we hook into the existing buffer logic in projectcache where we buffer until we receive a projectstate for an envelope, and simply add global config as a second requirement.

Implementation

  • we move the logic for getting global config from the envelopeprocessor to projectcache, and we include it in the message we send to envelopeprocessor for processing, which guarantees there's a valid global config at time of processing.
  • When we receive an envelope, we check if we have a global config handle_validate_envelope, if not we buffer the envelope.
  • When we receive projectstates, this used to always trigger a dequeue from the buffer, but now if we don't have global config we stop dequeing and push the projectkey that would have been dequeued into a BTreeSet.
  • When we receive a global config, we request new projectstates for all of the projectkeys in the BTreeSet, which will trigger a new dequeue when their projectstates arrive, this time it should dequeue successfully as we have a global config.

The reason we don't dequeue directly when we receive the global config is because the projectstates might have expired by the time the global config arrives. This might lead to some redundant fetching of projectstates but i figured it would be rather insignificant and it will only happen once, the added complexity of checking was not worth it imo. If you disagree, let me know!

@TBS1996 TBS1996 marked this pull request as ready for review November 10, 2023 09:03
@TBS1996 TBS1996 requested a review from a team November 10, 2023 09:03
@TBS1996 TBS1996 self-assigned this Nov 10, 2023
@TBS1996 TBS1996 linked an issue Nov 10, 2023 that may be closed by this pull request
olksdr
olksdr previously requested changes Nov 10, 2023
Copy link
Contributor

@olksdr olksdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks ok to me so far. And it would be great to add actually the integration test, to check the exact behaviour of the service we expect to see - like we should not process anything till the global config is available and fetched by the Relay.

CHANGELOG.md Outdated Show resolved Hide resolved
relay-server/src/actors/global_config.rs Outdated Show resolved Hide resolved
relay-server/src/actors/global_config.rs Outdated Show resolved Hide resolved
relay-server/src/actors/global_config.rs Outdated Show resolved Hide resolved
relay-server/src/actors/global_config.rs Outdated Show resolved Hide resolved
.and_then(|key| self.projects.get(&key))
.and_then(|p| p.valid_state());
.and_then(|p| p.valid_state())
.filter(|state| state.organization_id == own_project_state.organization_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is way more clear this way, not related to PR but just following the principle of cleaning up things in the same module I'm working in as i go along.

relay-server/src/actors/project_cache.rs Outdated Show resolved Hide resolved
.global_config()
.send(global_config::Get)
.await?
.get_ready_or_default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here we always actually return the default if the config currently unavailable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, not because i think it's good but because It's what we currently do and I wasn't sure to add another behaviour in the scope of this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that PoP relays can get a global config and process envelopes when they should not? A processing relay that hasn't fetched a global config from Sentry will send the default config to PoP relays, and PoPs will start processing. This is also the case for the following relays in the chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For others: me and iker synced and concluded that it's not a problem in practice because we deploy processing before pops with a significant time delay, and by that time processing should have a global config ready.

please others come with feedback about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time delay does not matter, if the processing Relays cannot get the global config from upstream.
When the PoPs are deployed while there is an incident with fetching global config, PoPs will get default one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

youre right, its not ideal. I discussed this with @Dav1dde who also pointed this out. We would have to communicate to the downstream relays that it's a global config they're receiving, or not send back at all. Backwards compatibility is an issue here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still mirrors current behaviour, we can tackle this in a separate issue imo. The possible solutions are actually quite tricky and depends on wether we want to change the API or not.

relay-server/src/actors/project_cache.rs Outdated Show resolved Hide resolved
Comment on lines +528 to +531
if let GlobalConfigStatus::Pending(keys) = &mut self.global_config {
keys.insert(partial_key);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned before, this potentially can be harmful , we can potentially accumulate a lot of data, in case we never can get the proper global config from the upstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that in mind when making the recent change as a set of projectkeys should take a lot less space than a vector of messages

i dont think the memory should be a problem here though, we already have many growing vectors in the projectcachebroker, although maybe we can use the indexed keys to fetch new states when the global config arrives.

we could sync on this in case we end up going back-and-forth a lot here

@TBS1996
Copy link
Contributor Author

TBS1996 commented Nov 10, 2023

Overall approach looks ok to me so far. And it would be great to add actually the integration test, to check the exact behaviour of the service we expect to see - like we should not process anything till the global config is available and fetched by the Relay.

I definitively agree we need integration test for this, I thought i'd wait until later but I can start right away since the purpose of the integration test is the behaviour not the implementation

@TBS1996 TBS1996 requested a review from jjbayer November 15, 2023 10:26
relay-server/src/actors/global_config.rs Show resolved Hide resolved
relay-server/src/actors/project_cache.rs Outdated Show resolved Hide resolved
Comment on lines +143 to +145
event = mini_sentry.captured_events.get(timeout=12).get_event()
assert event["logentry"] == {"formatted": "Hello, World!"}
assert retry_count == 2
assert retry_count == 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do project configs need more retries with this PR?

tests/integration/test_envelope.py Outdated Show resolved Hide resolved
tests/integration/test_envelope.py Outdated Show resolved Hide resolved
tests/integration/test_envelope.py Show resolved Hide resolved
tests/integration/test_envelope.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
relay-server/src/actors/project_cache.rs Outdated Show resolved Hide resolved
.global_config()
.send(global_config::Get)
.await?
.get_ready_or_default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time delay does not matter, if the processing Relays cannot get the global config from upstream.
When the PoPs are deployed while there is an incident with fetching global config, PoPs will get default one.

tests/integration/test_envelope.py Outdated Show resolved Hide resolved
tests/integration/test_envelope.py Outdated Show resolved Hide resolved
tests/integration/test_envelope.py Outdated Show resolved Hide resolved

# Global configs are fetched in 10 second intervals, so the event should have come
# through after a 10 sec timeout.
events_consumer.get_event(timeout=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, assert that after the global config is fetched we get all sent envelopes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean to do it in a more explicit manner? because afaict, calling get_event does assert that we got the envelope, since it will panic if not present

@olksdr olksdr dismissed their stale review November 17, 2023 07:32

Non blocking review from this point.

Copy link
Contributor

@olksdr olksdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me

You mentioned you might want to wait on @jan-auer to give a final review. Otherwise I would carefully test it in prod.

One important issue here to handle is upstream Relays, to make sure they receive a correct global config. It can be done in the immediate followup PR. But to be honest, I would make sure it's finished in this PR, so the feature is fully complete.

Copy link
Contributor

@iker-barriocanal iker-barriocanal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, one open question before approving

tests/integration/test_envelope.py Outdated Show resolved Hide resolved
@TBS1996
Copy link
Contributor Author

TBS1996 commented Nov 17, 2023

Looks fine to me

You mentioned you might want to wait on @jan-auer to give a final review. Otherwise I would carefully test it in prod.

One important issue here to handle is upstream Relays, to make sure they receive a correct global config. It can be done in the immediate followup PR. But to be honest, I would make sure it's finished in this PR, so the feature is fully complete.

do you mean the issue of them potentially getting a default global config and thinking its from sentry? how to handle that case isnt obvious, so we should probably discuss this together properly before doing anything here

# Check that we received exactly {envelope_qty} envelopes.
for _ in range(envelope_qty):
events_consumer.get_event(timeout=2)
events_consumer.assert_empty()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one doesn't require sleeping first, since we assert that it's not empty in the previous lines, and we just wanna check that we take out exactly envelope_qty envelopes

Copy link
Contributor

@olksdr olksdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to test.

The proper propagation of the global config to PoPs can be handled in the followups.

@TBS1996 TBS1996 enabled auto-merge (squash) November 21, 2023 09:40
@TBS1996 TBS1996 disabled auto-merge November 21, 2023 09:45
@TBS1996 TBS1996 merged commit 9092af5 into master Nov 21, 2023
20 checks passed
@TBS1996 TBS1996 deleted the tor/gcbuffer branch November 21, 2023 10:21
@@ -597,7 +597,7 @@ def get_project_config():

res = original_endpoint().get_json()
if not include_global:
res.pop("global")
res["global"] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change?

This means global will be a key with a None value. The previous implementation, and Sentry's behavior, is not to include global. As a result, this test emulates a different behavior from what Sentry does.

TBS1996 added a commit that referenced this pull request Nov 29, 2023
)

follow-up to #2697


we now have a mechanism to delay envelope processing until a global
config is available, however, we currently send back a default global
config if one is requested from downstream, which kind of defeats the
purpose if the downstream relay believes it has a valid global config
but it doesn't.
 
To fix this, this pr will return an additional flag whether the config
it sends is ready or not, up to date downstream relays will not use them
but instead keep trying to get a ready global config and not process
envelopes until they do. Older relays won't deserialize the status flag
and their behaviour will be as before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make global config required for processing
5 participants