Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hash of proxy settings as part of datasource instance cache key #1133

Merged
merged 7 commits into from
Nov 20, 2024

Conversation

njvrzm
Copy link
Contributor

@njvrzm njvrzm commented Nov 4, 2024

What this PR does / why we need it:
This makes the instance manager's cache key depend on the proxy configuration. Currently in Grafana Cloud, different Grafana instances have different socks proxy configuration, which makes NeedsUpdate return true every time a request comes from a different instance, which is usually every request. This has caused at least one major issue with a very large number of datasource instances being created unnecessarily.

With this change, the proxy configuration will be used to create the instance cache key. When comparing the cached instance to the incoming configuration, the proxy configuration will be the same unless it has actually been modified. As long as the datasource configuration doesn't change, there should only be one cached instance per Grafana instance per API server instance, rather than an ever-growing number.

Which issue(s) this PR fixes:
https://github.com/grafana/app-platform-wg/issues/174

Fixes #

Special notes for your reviewer:

@njvrzm njvrzm self-assigned this Nov 4, 2024
@njvrzm njvrzm requested a review from a team as a code owner November 4, 2024 14:15
@njvrzm njvrzm requested review from wbrowne, andresmgot, xnyo and dafydd-t and removed request for a team November 4, 2024 14:15
@njvrzm njvrzm force-pushed the njvrzm/use-socks-config-in-datasource-cache-key branch from 2d40f3b to ec07f95 Compare November 4, 2024 14:20
Copy link
Member

@wbrowne wbrowne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts

backend/datasource/instance_provider.go Outdated Show resolved Hide resolved
backend/proxy/secure_socks_proxy.go Outdated Show resolved Hide resolved
if c == nil {
return ""
}
h := fnv.New64a()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about how adding this affects performance? I had originally thought that maybe specifying an ID for the proxy config would be better (HG writes the same for instances of the same stack), and then we just use that as part of the instance key instead?

Copy link
Contributor Author

@njvrzm njvrzm Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about how adding this affects performance?

I have, yeah. Fnv is specifically designed for speed, and this will only be triggered on network requests, the latency of which should dominate. Just to be sure, though, now I've benchmarked it as well. It's a bit hard to get accurate results because the setup takes longer than the function, but it takes 1-2 microseconds per call to ProxyHashFromContext on my laptop, and less than half of that is due to fnv itself. That seems acceptable to me, for a network-driven code path, but please let me know if you disagree.

I had originally thought that maybe specifying an ID for the proxy config would be better

That does seem cleaner, but I don't think it's something I can tackle at the moment, and my understanding from @dafydd-t is that it's unlikely to get prioritized on that side very soon. This issue has already caused significant problems for at least one customer, and I think may have contributed to another recent escalation, so I'd like to fix it sooner than later. If adding a proxy config ID is something you can take on relatively soon I'm happy to change this to use it instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to drive that work if we establish that it's a better path to go down - but I'll wait for @dafydd-t to weigh in.

Also - is it fine to just hash the cert key or is the value also required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I think the key may be sufficient, but I will also wait for further input 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - is it fine to just hash the cert key or is the value also required?

I think you could take just the client key or the client cert - they are generated together. In fact, the cert contains a signature from the key, so just hashing the cert should be sufficient.

I think I would add the Root CAs to the hash, though, to handle the case where we have to rotate CA keys, we want the datasource API to refresh too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea mentioned before was that Grafana would do it only once and then would be sent for every request - otherwise with other solutions it has to be recalculated (when it never actually changes). I just thought that was ideal, but can see the concern about spreading code between grafana/grafana and the SDK. There could be some effort spent to make the code could live in the SDK though potentially (a function to create the map with pdc config and automatically populate the hashing field for example).

but the GrafanaCfg where I'm storing the hash now persists for the lifetime of the datasource instance.

This isn't true unfortunately (is tied to request and not instance) and therefore the hash is always recalculated.

Copy link
Contributor Author

@njvrzm njvrzm Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true unfortunately (is tied to request and not instance) and therefore the hash is always recalculated.

🤦🏻 I think I misunderstood something you said earlier, and didn't follow the code trail back far enough. My apologies. Sigh.

But I'd still really prefer to keep this entirely in the sdk, if possible. Two options I see for that:

  1. Accept the 1-2 microsecond latency hit on requests.
  2. Cache the cert -> hash mapping on the instance provider instead of in the GrafanaCfg.

I think I've come around on 1. - at a high but plausible rate of requests it could become a non-trivial amount of CPU. Now that I think about it, though, if we're going to do 2, we might as well skip the hash: just use the client key (since it's smaller) as part of the instance cache key. In fact we could just use the last, say, four bytes of it - the beginning is the same for all of them. The key is encoded in base64, so that gets us an acceptably low collision rate.

So, my final proposal:

  1. Add the last 4 bytes of the PDC client key value to the instance cache key.

Going to pause for one last round of feedback 😅 but I hope this is acceptable to everyone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - I went ahead with that approach, since it's getting late on Friday for me. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the last 4 bytes of the PDC client key value to the instance cache key.

The base64-decoded key is in PEM format (at least it is on file), so the last line is always the same. I'd recommend something a bit longer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, @dafydd-t, you're totally right, that's also true for the key contents passed in, so I've opened #1157 to fix that.

@njvrzm njvrzm requested a review from wbrowne November 5, 2024 09:53
@njvrzm njvrzm force-pushed the njvrzm/use-socks-config-in-datasource-cache-key branch from d5145c4 to 03b9eba Compare November 5, 2024 09:54
@njvrzm njvrzm force-pushed the njvrzm/use-socks-config-in-datasource-cache-key branch from 03b9eba to e48ae89 Compare November 5, 2024 09:56
dafydd-t
dafydd-t previously approved these changes Nov 14, 2024
@dafydd-t dafydd-t self-requested a review November 14, 2024 17:03
@dafydd-t dafydd-t dismissed their stale review November 14, 2024 17:03

significant changes

@njvrzm njvrzm force-pushed the njvrzm/use-socks-config-in-datasource-cache-key branch from 85fa234 to 5461e9d Compare November 15, 2024 10:13
backend/config.go Outdated Show resolved Hide resolved
@njvrzm njvrzm requested a review from wbrowne November 15, 2024 10:15
Copy link
Member

@wbrowne wbrowne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

backend/datasource/instance_provider_test.go Outdated Show resolved Hide resolved
@njvrzm njvrzm merged commit 0aa9d95 into main Nov 20, 2024
3 checks passed
@njvrzm njvrzm deleted the njvrzm/use-socks-config-in-datasource-cache-key branch November 20, 2024 10:51
@njvrzm njvrzm mentioned this pull request Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants