-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(server): Delay serving global config in managed mode before upstream response #2697
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall approach looks ok to me so far. And it would be great to add actually the integration test, to check the exact behaviour of the service we expect to see - like we should not process anything till the global config is available and fetched by the Relay.
.and_then(|key| self.projects.get(&key)) | ||
.and_then(|p| p.valid_state()); | ||
.and_then(|p| p.valid_state()) | ||
.filter(|state| state.organization_id == own_project_state.organization_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code is way more clear this way, not related to PR but just following the principle of cleaning up things in the same module I'm working in as i go along.
.global_config() | ||
.send(global_config::Get) | ||
.await? | ||
.get_ready_or_default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here we always actually return the default if the config currently unavailable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, not because i think it's good but because It's what we currently do and I wasn't sure to add another behaviour in the scope of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that PoP relays can get a global config and process envelopes when they should not? A processing relay that hasn't fetched a global config from Sentry will send the default config to PoP relays, and PoPs will start processing. This is also the case for the following relays in the chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For others: me and iker synced and concluded that it's not a problem in practice because we deploy processing before pops with a significant time delay, and by that time processing should have a global config ready.
please others come with feedback about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time delay does not matter, if the processing Relays cannot get the global config from upstream.
When the PoPs are deployed while there is an incident with fetching global config, PoPs will get default one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
youre right, its not ideal. I discussed this with @Dav1dde who also pointed this out. We would have to communicate to the downstream relays that it's a global config they're receiving, or not send back at all. Backwards compatibility is an issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still mirrors current behaviour, we can tackle this in a separate issue imo. The possible solutions are actually quite tricky and depends on wether we want to change the API or not.
if let GlobalConfigStatus::Pending(keys) = &mut self.global_config { | ||
keys.insert(partial_key); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned before, this potentially can be harmful , we can potentially accumulate a lot of data, in case we never can get the proper global config from the upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had that in mind when making the recent change as a set of projectkeys should take a lot less space than a vector of messages
i dont think the memory should be a problem here though, we already have many growing vectors in the projectcachebroker, although maybe we can use the indexed keys to fetch new states when the global config arrives.
we could sync on this in case we end up going back-and-forth a lot here
7ceb879
to
3aebe13
Compare
I definitively agree we need integration test for this, I thought i'd wait until later but I can start right away since the purpose of the integration test is the behaviour not the implementation |
event = mini_sentry.captured_events.get(timeout=12).get_event() | ||
assert event["logentry"] == {"formatted": "Hello, World!"} | ||
assert retry_count == 2 | ||
assert retry_count == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do project configs need more retries with this PR?
.global_config() | ||
.send(global_config::Get) | ||
.await? | ||
.get_ready_or_default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time delay does not matter, if the processing Relays cannot get the global config from upstream.
When the PoPs are deployed while there is an incident with fetching global config, PoPs will get default one.
tests/integration/test_envelope.py
Outdated
|
||
# Global configs are fetched in 10 second intervals, so the event should have come | ||
# through after a 10 sec timeout. | ||
events_consumer.get_event(timeout=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, assert that after the global config is fetched we get all sent envelopes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean to do it in a more explicit manner? because afaict, calling get_event does assert that we got the envelope, since it will panic if not present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me
You mentioned you might want to wait on @jan-auer to give a final review. Otherwise I would carefully test it in prod.
One important issue here to handle is upstream Relays, to make sure they receive a correct global config. It can be done in the immediate followup PR. But to be honest, I would make sure it's finished in this PR, so the feature is fully complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM, one open question before approving
do you mean the issue of them potentially getting a default global config and thinking its from sentry? how to handle that case isnt obvious, so we should probably discuss this together properly before doing anything here |
# Check that we received exactly {envelope_qty} envelopes. | ||
for _ in range(envelope_qty): | ||
events_consumer.get_event(timeout=2) | ||
events_consumer.assert_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one doesn't require sleeping first, since we assert that it's not empty in the previous lines, and we just wanna check that we take out exactly envelope_qty
envelopes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to test.
The proper propagation of the global config to PoPs can be handled in the followups.
@@ -597,7 +597,7 @@ def get_project_config(): | |||
|
|||
res = original_endpoint().get_json() | |||
if not include_global: | |||
res.pop("global") | |||
res["global"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this change?
This means global
will be a key with a None
value. The previous implementation, and Sentry's behavior, is not to include global
. As a result, this test emulates a different behavior from what Sentry does.
) follow-up to #2697 we now have a mechanism to delay envelope processing until a global config is available, however, we currently send back a default global config if one is requested from downstream, which kind of defeats the purpose if the downstream relay believes it has a valid global config but it doesn't. To fix this, this pr will return an additional flag whether the config it sends is ready or not, up to date downstream relays will not use them but instead keep trying to get a ready global config and not process envelopes until they do. Older relays won't deserialize the status flag and their behaviour will be as before.
Problem
currently, before we request global config from upstream, we will simply serve the default global config. This kind of implicit behaviour is not ideal.
Solution
This PR will affect the behaviour in managed mode, where we won't start processing envelopes until we have received a valid global config. To do this we hook into the existing buffer logic in projectcache where we buffer until we receive a projectstate for an envelope, and simply add global config as a second requirement.
Implementation
handle_validate_envelope
, if not we buffer the envelope.The reason we don't dequeue directly when we receive the global config is because the projectstates might have expired by the time the global config arrives. This might lead to some redundant fetching of projectstates but i figured it would be rather insignificant and it will only happen once, the added complexity of checking was not worth it imo. If you disagree, let me know!