Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(rfc): operation cache warmer #1115

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
107 changes: 107 additions & 0 deletions rfc/distributed-operation-cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
title: "Distributed Operation Cache"
StarpTech marked this conversation as resolved.
Show resolved Hide resolved
author: Dustin Deus
date: 2024-08-25
status: Draft
---

# Distributed Operation Cache

- **Author:** Dustin Deus
- **Date:** 2024-08-25
- **Status:** Draft

## Abstract

This RFC describes a new feature to reduce the latency of the system by pre-planning the most expensive and requested operations before the router accepts traffic. We achieve this by computing the Top-N GraphQL operations available and making them available to all routers instances before they accept traffic.

## Motivation

GraphQL is a powerful tool to query data from a server. However, the flexibility of the query language comes with a cost. The cost is the complexity of the query and how expensive it is to normalize, plan and execute it. While execution performance is primarily a concern of the underlying subgraphs, the planning phase can be a unpredictable and significant latency contributor. The distributed operation cache aims to reduce this latency by pre-planning the most expensive and requested operations ahead to make it invisible to the user.

# Proposal
StarpTech marked this conversation as resolved.
Show resolved Hide resolved

The distributed operation cache is semi-automatic and allows the user to push specific operations to the cache but also automatically computes the most expensive and requested operations of the last time frame (configurable). The cache has a fixed size of operations e.g. 100 (configurable) and is shared across all router instances. An operation can be a regular query, subscription, mutation or persisted operation. When the cache capacity is reached, manual operations have a higher priority than automatic operations. This allows users to manage the priority of operations in the cache themselves.

### Pushing operations to the cache

The User can push individual operations to the distributed operation cache by using the CLI:

```bash
wgc router cache add -g mygraph operations.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be under the namespace "router"?
At which level do we apply cache warming?
Org? Namespace? Graph? Router?

As an alternative, you could consider creating a "cache-warming-bucket", which you can then use to add operations, and which can be used in the router config, e.g.:

version: "1"

cache_warmup:
  enabled: true
  interval: 5m
  bucket: myBucket

I'm not sure though if this is the right direction.
The cache warming bucket is somewhat closely related to a Graph because the containing operations need to be compatible with the client schema of the graph.

```

The CLI command will add the operations from the file `operations.json` to the distributed operation cache of the graph `mygraph`. The file must contain a list of operations in JSON format. The operations can be queries, subscriptions, mutations or persisted operations.

```json5
[
// Queries
{
"body": "query { ... }"
},
// Persisted operation
{
"sha256Hash": "1234567890",
"body": "query { ... }",
}
]
```

The cli command is idempotent and always updates the cache with the latest operations. This doesn't trigger the computation of the Top-N operations which is done periodically by the Cosmo Platform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another aspect to think about is pre-compiling operation plans.
We could make plans serializable, generate and serialize them at composition time.
Once the router wants to "load" them, it doesn't have to plan everything but can load the plan much faster.


### Automatic operation computation

At the same time, WunderGraph Cosmo is analyzing the incoming traffic based on the OpenTelemetry metrics that each router is sending. The Cosmo Platform computes the Top-N operations for each graph and combines it with the manually added operations. The Top-N operations are then pushed to the distributed operation cache of the graph.

### Top-N computation

The Top-N computation is based on the following metrics:

- Total operation pre-execution time: Normalization, Validation, Planning
- Total request count

The Top-N computation is done for a specific time interval e.g. 3-72 hour (configurable). The operations are sorted by the pre-execution time and request count. The Top-N operations are then pushed to the distributed operation cache. Manual operations have a higher priority than automatic operations. This means when the cache capacity is reached, manual operations are moved to the cache first and automatic operations are removed.

#### Example

The following example shows the Top-5 operations of a graph. The cache capacity is 5. The operations are sorted by the total pre-execution time and request count in descending order.

```
Operation A: 400ms, 1000 requests (Manual added)
Operation B: 300ms, 500 requests (Automatic)
Operation C: 200ms, 200 requests (Automatic)
Operation D: 100ms, 100 requests (Manual added)
Operation E: 50ms, 50 requests (Automatic)
```

The user can add three more manual operations to the cache until the cache capacity is reached. This has the effect that no automatic operations can be added to the cache. In that case, we assume that the user knows better which operations are important. If the capacity is reached and the user adds another manual operation, the least expensive manual operation is removed from the cache.

### Cache update process

The router checks periodically e.g. every 5min for updates of the distributed operation cache. The cache is checked explicitly when the router starts and when the schema changes. The cache is loaded and all operations are pre-planned before the router accepts traffic. The cache is updated in the background and doesn't block the router from accepting traffic.

### Platform integration

For containerized environments like Kubernetes, users should use the readiness probe to ensure that the router is ready to accept traffic. Setting not to small values for the readiness probe timeout is recommended to ensure that the router has enough time to prepare the cache. For schema updates after startup, this process is non-blocking because the new graph schema isn't swapped until the cache is warmed up.

### Cosmo UI integration

A User can disable the distributed operation cache in the Cosmo UI. The User can see the current operations in the cache and remove them if necessary. The User can also see the current status of the cache and the last computation time.

#### Triggering the computation manually

A User is able to trigger the computation of the Top-N operations manually in the Cosmo UI. This is useful for debugging purposes.

## Router configuration

The distributed operation cache can be enabled or disabled in the router configuration file. The default is enabled. A valid Graph API key is required to fetch the operations cache from the Cosmo Platform.

```yaml
version: "1"

cache_warmup:
enabled: true
interval: 5m
```

_For this RFC, we only consider support for the WunderGraph Cosmo CDN._
Loading