Calculation of Cache Hit Ratio #1551

ilterpehlivan · 2023-09-28T14:01:16Z

ilterpehlivan
Sep 28, 2023

Hi,

We are evaluating spicedb for one of our client and they asked to see performance results. We ran some load testing with 2 level schema and 1 million relationships but pur P95 is way higher (700ms) than what is posted in this article (https://authzed.com/blog/google-scale-authorization#checkpermission-latency-3)
Btw we followed the similar infra suggested in that link but we use Postgre instead of CockRoachDB
So question comes down the use of caches! I dont believe our cache hit ratio is same as the article (%95).

Therefore could you please help to calculate the cache hit ratio by using collected promethus metrics ?

Thanks

Answered by ecordell

Sep 29, 2023

This is a big topic; I'll try and give a few pointers below. We can also work through specific issues if you can provide more details.

We ran some load testing with 2 level schema and 1 million relationships but pur P95 is way higher (700ms) than what is posted in this article (https://authzed.com/blog/google-scale-authorization#checkpermission-latency-3)

To preface this discussion, it can be difficult to generate a representative load for SpiceDB: the schema, the shape and quantity of the data stored, and the specific access patterns can have a big effect on performance.

If you hit the same relationship over and over again, you'll get perfect caching and see amazing, unrealistic result…

View full answer

ecordell · 2023-09-29T18:06:25Z

ecordell
Sep 29, 2023
Maintainer

This is a big topic; I'll try and give a few pointers below. We can also work through specific issues if you can provide more details.

We ran some load testing with 2 level schema and 1 million relationships but pur P95 is way higher (700ms) than what is posted in this article (https://authzed.com/blog/google-scale-authorization#checkpermission-latency-3)

To preface this discussion, it can be difficult to generate a representative load for SpiceDB: the schema, the shape and quantity of the data stored, and the specific access patterns can have a big effect on performance.

If you hit the same relationship over and over again, you'll get perfect caching and see amazing, unrealistic results. On the other end of the spectrum, if every query is totally random, you'll get worst-case performance because the cache is unlikely to have the results you need for any particular request, and you'll be bottlenecked by the database.

Real traffic is usually somewhere in the middle: specific request paths in applications often make use of smaller subsets of semantically related relationships, so a single workflow in an application will have some mix of cache and non-cached results.

We have a tool for generating more realistic load (mentioned in the article you linked) but it is not currently open source.

Btw we followed the similar infra suggested in that link but we use Postgre instead of CockRoachDB

Cockroach and Postgres use different strategies for storing data: in Cockroach we make use of AS OF SYSTEM TIME queries to make snapshot queries; in Postgres we record xid information on the relationship itself and use PG's snapshot visibility to decide which relationships exist in a snapshot. The difference in approach means PG also runs a separate GC process to clean up old relationships. Just pointing out that the backing datastores are not perfectly 1-1, and we're constantly finding improvements for one datastore that may not apply to the other.

Also just to note: especially at lower caching ratios, the capacity of the database can be an important factor for scaling. You don't mention what resources you're provisioning for postgres, but it's possible you're just sending too much db traffic for the instance you're using.

Therefore could you please help to calculate the cache hit ratio by using collected promethus metrics ?

With all of that out of the way, there are two sets of metrics for cache ratio that you might be interested in.

Cached Requests

The first are pure request counts, they look like spicedb_dispatch_*_from_cache_total and spicedb_dispatch_client_*_from_cache_total.

dispatch_client metrics are: how many dispatch requests that a node would have sent were avoided because a result was found in the node's local cache.
dispatch: how many dispatch requests were received that didn't have to do any graph computation because a result was found in the node's local cache.

One measures from the client side (didn't have to send a request) and one from the server side (received a request, but didn't have to do any work to answer it).

Here's an example of computing the server-side cache hit ratio for checks with request counts:

sum(rate(spicedb_dispatch_check_from_cache_total{container="spicedb"}[$__interval])) / sum(rate(spicedb_dispatch_check_total{container="spicedb"}[$__interval]))

Cached Operations

Request-count metrics aren't the best way to think about caching ratios though, because requests can fan out (i.e. a cached result can save N graph operations, not just 1).

The other set of metrics we use for computing cache rates are work-based, and report the amount of work avoided rather then simple hit/miss. They're available under spicedb_services_dispatches_sum metric names.

For example, to compute percent of cached operations for checks, something like:

100 * sum(rate(spicedb_services_dispatches_sum{cached="true", method="CheckPermission"}[1m])) / (sum(rate(spicedb_services_dispatches_sum{method="CheckPermission"}[1m])))

Each cache entry stores how many downstream requests it took to compute it initially. So if you have very deep or wide data, a single cache entry could save you 5 or 10 or 100 downstream requests. These metrics report a more representative caching ratio, as "work avoided" rather than "cache hits".

A little long-winded, but I'm hoping this helps orient the problem for you. Let us know if you have more specific tests you want to run.

1 reply

ilterpehlivan Oct 2, 2023
Author

Thanks @ecordell for detailed answer and I appreciate your advises

We followed your advice and created properly distributed load based on "Pareto" distribution as you did in your published tests. Then this helped to increase cache hit ratio around %75 which is awesome :) However our latency results are still not improved!
For 100 Rps, results look good in 2 digits but once we increase load to 300Rps or 500rps it goes messy. Our infrastructure is solid with 4 nodes eks cluster with strong compute resources also we monitor cluster CPU/memory status which does not pass over %60 usage!

Here my question comes; do you think this behaviour possible, by even increasing cache hit still lower results ? Or our latency calculation parts (based on otel traces and aws Xray) might be bottleneck which could give us wrong results ?

P.S: it would be really helpful if you can also share your monitoring setup during your performance tests

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

authzed

Calculation of Cache Hit Ratio #1551

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

authzed

Calculation of Cache Hit Ratio #1551

ilterpehlivan Sep 28, 2023

Replies: 1 comment · 1 reply

ecordell Sep 29, 2023 Maintainer

Cached Requests

Cached Operations

ilterpehlivan Oct 2, 2023 Author

ilterpehlivan
Sep 28, 2023

Replies: 1 comment 1 reply

ecordell
Sep 29, 2023
Maintainer

ilterpehlivan Oct 2, 2023
Author