feat(server): Delay serving global config in managed mode before upstream response #2697

TBS1996 · 2023-11-08T11:47:47Z

Problem

currently, before we request global config from upstream, we will simply serve the default global config. This kind of implicit behaviour is not ideal.

Solution

This PR will affect the behaviour in managed mode, where we won't start processing envelopes until we have received a valid global config. To do this we hook into the existing buffer logic in projectcache where we buffer until we receive a projectstate for an envelope, and simply add global config as a second requirement.

Implementation

we move the logic for getting global config from the envelopeprocessor to projectcache, and we include it in the message we send to envelopeprocessor for processing, which guarantees there's a valid global config at time of processing.
When we receive an envelope, we check if we have a global config handle_validate_envelope, if not we buffer the envelope.
When we receive projectstates, this used to always trigger a dequeue from the buffer, but now if we don't have global config we stop dequeing and push the projectkey that would have been dequeued into a BTreeSet.
When we receive a global config, we request new projectstates for all of the projectkeys in the BTreeSet, which will trigger a new dequeue when their projectstates arrive, this time it should dequeue successfully as we have a global config.

The reason we don't dequeue directly when we receive the global config is because the projectstates might have expired by the time the global config arrives. This might lead to some redundant fetching of projectstates but i figured it would be rather insignificant and it will only happen once, the added complexity of checking was not worth it imo. If you disagree, let me know!

olksdr

Overall approach looks ok to me so far. And it would be great to add actually the integration test, to check the exact behaviour of the service we expect to see - like we should not process anything till the global config is available and fetched by the Relay.

CHANGELOG.md

relay-server/src/actors/global_config.rs

olksdr · 2023-11-10T10:14:16Z

relay-server/src/actors/project_cache.rs

                .and_then(|key| self.projects.get(&key))
-                .and_then(|p| p.valid_state());
+                .and_then(|p| p.valid_state())
+                .filter(|state| state.organization_id == own_project_state.organization_id);


Why was this changed?

I think the code is way more clear this way, not related to PR but just following the principle of cleaning up things in the same module I'm working in as i go along.

relay-server/src/actors/project_cache.rs

olksdr · 2023-11-10T10:26:56Z

relay-server/src/endpoints/project_configs.rs

+                .global_config()
+                .send(global_config::Get)
+                .await?
+                .get_ready_or_default(),


So here we always actually return the default if the config currently unavailable?

yes, not because i think it's good but because It's what we currently do and I wasn't sure to add another behaviour in the scope of this PR

Does this mean that PoP relays can get a global config and process envelopes when they should not? A processing relay that hasn't fetched a global config from Sentry will send the default config to PoP relays, and PoPs will start processing. This is also the case for the following relays in the chain.

For others: me and iker synced and concluded that it's not a problem in practice because we deploy processing before pops with a significant time delay, and by that time processing should have a global config ready.

please others come with feedback about this

The time delay does not matter, if the processing Relays cannot get the global config from upstream.
When the PoPs are deployed while there is an incident with fetching global config, PoPs will get default one.

youre right, its not ideal. I discussed this with @Dav1dde who also pointed this out. We would have to communicate to the downstream relays that it's a global config they're receiving, or not send back at all. Backwards compatibility is an issue here.

I think this still mirrors current behaviour, we can tackle this in a separate issue imo. The possible solutions are actually quite tricky and depends on wether we want to change the API or not.

relay-server/src/actors/project_cache.rs

olksdr · 2023-11-10T10:31:58Z

relay-server/src/actors/project_cache.rs

+        if let GlobalConfigStatus::Pending(keys) = &mut self.global_config {
+            keys.insert(partial_key);
+            return;
+        }


As I mentioned before, this potentially can be harmful , we can potentially accumulate a lot of data, in case we never can get the proper global config from the upstream.

I had that in mind when making the recent change as a set of projectkeys should take a lot less space than a vector of messages

i dont think the memory should be a problem here though, we already have many growing vectors in the projectcachebroker, although maybe we can use the indexed keys to fetch new states when the global config arrives.

we could sync on this in case we end up going back-and-forth a lot here

TBS1996 · 2023-11-10T10:56:27Z

Overall approach looks ok to me so far. And it would be great to add actually the integration test, to check the exact behaviour of the service we expect to see - like we should not process anything till the global config is available and fetched by the Relay.

I definitively agree we need integration test for this, I thought i'd wait until later but I can start right away since the purpose of the integration test is the behaviour not the implementation

relay-server/src/actors/global_config.rs

relay-server/src/actors/project_cache.rs

iker-barriocanal · 2023-11-15T14:21:36Z

tests/integration/test_query.py

+        event = mini_sentry.captured_events.get(timeout=12).get_event()
        assert event["logentry"] == {"formatted": "Hello, World!"}
-        assert retry_count == 2
+        assert retry_count == 4


Why do project configs need more retries with this PR?

tests/integration/test_envelope.py

CHANGELOG.md

relay-server/src/actors/project_cache.rs

olksdr · 2023-11-16T09:19:35Z

relay-server/src/endpoints/project_configs.rs

+                .global_config()
+                .send(global_config::Get)
+                .await?
+                .get_ready_or_default(),


The time delay does not matter, if the processing Relays cannot get the global config from upstream.
When the PoPs are deployed while there is an incident with fetching global config, PoPs will get default one.

tests/integration/test_envelope.py

olksdr · 2023-11-16T09:22:18Z

tests/integration/test_envelope.py

+
+    # Global configs are fetched in 10 second intervals, so the event should have come
+    # through after a 10 sec timeout.
+    events_consumer.get_event(timeout=10)


Please, assert that after the global config is fetched we get all sent envelopes here.

do you mean to do it in a more explicit manner? because afaict, calling get_event does assert that we got the envelope, since it will panic if not present

Non blocking review from this point.

olksdr

Looks fine to me

You mentioned you might want to wait on @jan-auer to give a final review. Otherwise I would carefully test it in prod.

One important issue here to handle is upstream Relays, to make sure they receive a correct global config. It can be done in the immediate followup PR. But to be honest, I would make sure it's finished in this PR, so the feature is fully complete.

iker-barriocanal

Code LGTM, one open question before approving

tests/integration/test_envelope.py

TBS1996 · 2023-11-17T14:33:32Z

Looks fine to me

You mentioned you might want to wait on @jan-auer to give a final review. Otherwise I would carefully test it in prod.

One important issue here to handle is upstream Relays, to make sure they receive a correct global config. It can be done in the immediate followup PR. But to be honest, I would make sure it's finished in this PR, so the feature is fully complete.

do you mean the issue of them potentially getting a default global config and thinking its from sentry? how to handle that case isnt obvious, so we should probably discuss this together properly before doing anything here

TBS1996 · 2023-11-17T14:42:04Z

tests/integration/test_envelope.py

+    # Check that we received exactly {envelope_qty} envelopes.
+    for _ in range(envelope_qty):
+        events_consumer.get_event(timeout=2)
+    events_consumer.assert_empty()


this one doesn't require sleeping first, since we assert that it's not empty in the previous lines, and we just wanna check that we take out exactly envelope_qty envelopes

olksdr

Good to test.

The proper propagation of the global config to PoPs can be handled in the followups.

iker-barriocanal · 2023-11-21T10:51:00Z

tests/integration/test_envelope.py

@@ -597,7 +597,7 @@ def get_project_config():

        res = original_endpoint().get_json()
        if not include_global:
-            res.pop("global")
+            res["global"] = None


What's the reason for this change?

This means global will be a key with a None value. The previous implementation, and Sentry's behavior, is not to include global. As a result, this test emulates a different behavior from what Sentry does.

) follow-up to #2697 we now have a mechanism to delay envelope processing until a global config is available, however, we currently send back a default global config if one is requested from downstream, which kind of defeats the purpose if the downstream relay believes it has a valid global config but it doesn't. To fix this, this pr will return an additional flag whether the config it sends is ready or not, up to date downstream relays will not use them but instead keep trying to get a ready global config and not process envelopes until they do. Older relays won't deserialize the status flag and their behaviour will be as before.

TBS1996 added 21 commits November 8, 2023 12:46

init

370eb3b

wip

8634049

wip

b250419

wip

8bc4c6b

wip

f155001

wip

0266eb0

wip

d555e7b

wip

c6206bc

wip

0cdffda

wip

dc13b9c

wip

9600a42

wip

397c477

wip

0e9a04e

wip

6b68257

wip

1a68659

Merge branch 'master' into tor/gcbuffer

dc3efad

wip

9e1d168

wip

249b77b

wip

0c944c3

Merge branch 'master' into tor/gcbuffer

03758eb

wip

522ea67

TBS1996 marked this pull request as ready for review November 10, 2023 09:03

TBS1996 requested a review from a team November 10, 2023 09:03

TBS1996 self-assigned this Nov 10, 2023

TBS1996 linked an issue Nov 10, 2023 that may be closed by this pull request

Make global config required for processing #2584

Closed

fix changelog

3aebe13

olksdr previously requested changes Nov 10, 2023

View reviewed changes

TBS1996 force-pushed the tor/gcbuffer branch from 7ceb879 to 3aebe13 Compare November 10, 2023 10:37

address feedback

c35d4ce

TBS1996 requested a review from jjbayer November 15, 2023 10:26

TBS1996 added 2 commits November 15, 2023 12:09

nit

a6cd5f1

Merge branch 'master' into tor/gcbuffer

06e262c

iker-barriocanal reviewed Nov 15, 2023

View reviewed changes

olksdr reviewed Nov 16, 2023

View reviewed changes

TBS1996 added 3 commits November 17, 2023 05:36

address feedback

86a396f

Merge branch 'master' into tor/gcbuffer

1e8d941

fix changelog entry

a6dd1c7

TBS1996 requested review from olksdr and iker-barriocanal November 17, 2023 04:47

olksdr reviewed Nov 17, 2023

View reviewed changes

iker-barriocanal reviewed Nov 17, 2023

View reviewed changes

tests/integration/test_envelope.py Outdated Show resolved Hide resolved

sleep before assert empty

96fb0e1

TBS1996 commented Nov 17, 2023

View reviewed changes

use timeout in assert_empty

f5adc52

TBS1996 requested review from iker-barriocanal, olksdr and Dav1dde November 20, 2023 05:17

olksdr approved these changes Nov 21, 2023

View reviewed changes

Merge branch 'master' into tor/gcbuffer

887bc98

TBS1996 enabled auto-merge (squash) November 21, 2023 09:40

TBS1996 disabled auto-merge November 21, 2023 09:45

merge

4bb1cbf

TBS1996 merged commit 9092af5 into master Nov 21, 2023

TBS1996 deleted the tor/gcbuffer branch November 21, 2023 10:21

iker-barriocanal reviewed Nov 21, 2023

View reviewed changes

TBS1996 mentioned this pull request Nov 27, 2023

feat(server): Return global config status for downstream requests #2765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): Delay serving global config in managed mode before upstream response #2697

feat(server): Delay serving global config in managed mode before upstream response #2697

TBS1996 commented Nov 8, 2023 •

edited

Loading

olksdr left a comment

olksdr Nov 10, 2023

TBS1996 Nov 10, 2023

olksdr Nov 10, 2023

TBS1996 Nov 10, 2023

iker-barriocanal Nov 13, 2023

TBS1996 Nov 13, 2023

olksdr Nov 16, 2023

TBS1996 Nov 17, 2023

Dav1dde Nov 17, 2023

olksdr Nov 10, 2023

TBS1996 Nov 10, 2023

TBS1996 commented Nov 10, 2023

iker-barriocanal Nov 15, 2023

olksdr Nov 16, 2023

olksdr Nov 16, 2023

TBS1996 Nov 17, 2023

olksdr left a comment

iker-barriocanal left a comment

TBS1996 commented Nov 17, 2023

TBS1996 Nov 17, 2023

olksdr left a comment

iker-barriocanal Nov 21, 2023

feat(server): Delay serving global config in managed mode before upstream response #2697

feat(server): Delay serving global config in managed mode before upstream response #2697

Conversation

TBS1996 commented Nov 8, 2023 • edited Loading

Problem

Solution

Implementation

olksdr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TBS1996 commented Nov 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olksdr left a comment

Choose a reason for hiding this comment

iker-barriocanal left a comment

Choose a reason for hiding this comment

TBS1996 commented Nov 17, 2023

Choose a reason for hiding this comment

olksdr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TBS1996 commented Nov 8, 2023 •

edited

Loading