-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use hash of proxy settings as part of datasource instance cache key #1133
Conversation
2d40f3b
to
ec07f95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts
backend/proxy/secure_socks_proxy.go
Outdated
if c == nil { | ||
return "" | ||
} | ||
h := fnv.New64a() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought about how adding this affects performance? I had originally thought that maybe specifying an ID for the proxy config would be better (HG writes the same for instances of the same stack), and then we just use that as part of the instance key instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought about how adding this affects performance?
I have, yeah. Fnv is specifically designed for speed, and this will only be triggered on network requests, the latency of which should dominate. Just to be sure, though, now I've benchmarked it as well. It's a bit hard to get accurate results because the setup takes longer than the function, but it takes 1-2 microseconds per call to ProxyHashFromContext
on my laptop, and less than half of that is due to fnv itself. That seems acceptable to me, for a network-driven code path, but please let me know if you disagree.
I had originally thought that maybe specifying an ID for the proxy config would be better
That does seem cleaner, but I don't think it's something I can tackle at the moment, and my understanding from @dafydd-t is that it's unlikely to get prioritized on that side very soon. This issue has already caused significant problems for at least one customer, and I think may have contributed to another recent escalation, so I'd like to fix it sooner than later. If adding a proxy config ID is something you can take on relatively soon I'm happy to change this to use it instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to drive that work if we establish that it's a better path to go down - but I'll wait for @dafydd-t to weigh in.
Also - is it fine to just hash the cert key or is the value also required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I think the key may be sufficient, but I will also wait for further input 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - is it fine to just hash the cert key or is the value also required?
I think you could take just the client key or the client cert - they are generated together. In fact, the cert contains a signature from the key, so just hashing the cert should be sufficient.
I think I would add the Root CAs to the hash, though, to handle the case where we have to rotate CA keys, we want the datasource API to refresh too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea mentioned before was that Grafana would do it only once and then would be sent for every request - otherwise with other solutions it has to be recalculated (when it never actually changes). I just thought that was ideal, but can see the concern about spreading code between grafana/grafana and the SDK. There could be some effort spent to make the code could live in the SDK though potentially (a function to create the map with pdc config and automatically populate the hashing field for example).
but the GrafanaCfg where I'm storing the hash now persists for the lifetime of the datasource instance.
This isn't true unfortunately (is tied to request and not instance) and therefore the hash is always recalculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true unfortunately (is tied to request and not instance) and therefore the hash is always recalculated.
🤦🏻 I think I misunderstood something you said earlier, and didn't follow the code trail back far enough. My apologies. Sigh.
But I'd still really prefer to keep this entirely in the sdk, if possible. Two options I see for that:
- Accept the 1-2 microsecond latency hit on requests.
- Cache the cert -> hash mapping on the instance provider instead of in the
GrafanaCfg
.
I think I've come around on 1. - at a high but plausible rate of requests it could become a non-trivial amount of CPU. Now that I think about it, though, if we're going to do 2, we might as well skip the hash: just use the client key (since it's smaller) as part of the instance cache key. In fact we could just use the last, say, four bytes of it - the beginning is the same for all of them. The key is encoded in base64, so that gets us an acceptably low collision rate.
So, my final proposal:
- Add the last 4 bytes of the PDC client key value to the instance cache key.
Going to pause for one last round of feedback 😅 but I hope this is acceptable to everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well - I went ahead with that approach, since it's getting late on Friday for me. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the last 4 bytes of the PDC client key value to the instance cache key.
The base64-decoded key is in PEM format (at least it is on file), so the last line is always the same. I'd recommend something a bit longer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d5145c4
to
03b9eba
Compare
03b9eba
to
e48ae89
Compare
85fa234
to
5461e9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What this PR does / why we need it:
This makes the instance manager's cache key depend on the proxy configuration. Currently in Grafana Cloud, different Grafana instances have different socks proxy configuration, which makes
NeedsUpdate
return true every time a request comes from a different instance, which is usually every request. This has caused at least one major issue with a very large number of datasource instances being created unnecessarily.With this change, the proxy configuration will be used to create the instance cache key. When comparing the cached instance to the incoming configuration, the proxy configuration will be the same unless it has actually been modified. As long as the datasource configuration doesn't change, there should only be one cached instance per Grafana instance per API server instance, rather than an ever-growing number.
Which issue(s) this PR fixes:
https://github.com/grafana/app-platform-wg/issues/174
Fixes #
Special notes for your reviewer: