Calculation of Cache Hit Ratio #1551
-
Hi, We are evaluating spicedb for one of our client and they asked to see performance results. We ran some load testing with 2 level schema and 1 million relationships but pur P95 is way higher (700ms) than what is posted in this article (https://authzed.com/blog/google-scale-authorization#checkpermission-latency-3) Therefore could you please help to calculate the cache hit ratio by using collected promethus metrics ? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
This is a big topic; I'll try and give a few pointers below. We can also work through specific issues if you can provide more details.
To preface this discussion, it can be difficult to generate a representative load for SpiceDB: the schema, the shape and quantity of the data stored, and the specific access patterns can have a big effect on performance. If you hit the same relationship over and over again, you'll get perfect caching and see amazing, unrealistic results. On the other end of the spectrum, if every query is totally random, you'll get worst-case performance because the cache is unlikely to have the results you need for any particular request, and you'll be bottlenecked by the database. Real traffic is usually somewhere in the middle: specific request paths in applications often make use of smaller subsets of semantically related relationships, so a single workflow in an application will have some mix of cache and non-cached results. We have a tool for generating more realistic load (mentioned in the article you linked) but it is not currently open source.
Cockroach and Postgres use different strategies for storing data: in Cockroach we make use of Also just to note: especially at lower caching ratios, the capacity of the database can be an important factor for scaling. You don't mention what resources you're provisioning for postgres, but it's possible you're just sending too much db traffic for the instance you're using.
With all of that out of the way, there are two sets of metrics for cache ratio that you might be interested in. Cached RequestsThe first are pure request counts, they look like
One measures from the client side (didn't have to send a request) and one from the server side (received a request, but didn't have to do any work to answer it). Here's an example of computing the server-side cache hit ratio for checks with request counts:
Cached OperationsRequest-count metrics aren't the best way to think about caching ratios though, because requests can fan out (i.e. a cached result can save N graph operations, not just 1). The other set of metrics we use for computing cache rates are work-based, and report the amount of work avoided rather then simple hit/miss. They're available under For example, to compute percent of cached operations for checks, something like:
Each cache entry stores how many downstream requests it took to compute it initially. So if you have very deep or wide data, a single cache entry could save you 5 or 10 or 100 downstream requests. These metrics report a more representative caching ratio, as "work avoided" rather than "cache hits". A little long-winded, but I'm hoping this helps orient the problem for you. Let us know if you have more specific tests you want to run. |
Beta Was this translation helpful? Give feedback.
This is a big topic; I'll try and give a few pointers below. We can also work through specific issues if you can provide more details.
To preface this discussion, it can be difficult to generate a representative load for SpiceDB: the schema, the shape and quantity of the data stored, and the specific access patterns can have a big effect on performance.
If you hit the same relationship over and over again, you'll get perfect caching and see amazing, unrealistic result…