Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(rfc): operation cache warmer #1115

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
107 changes: 107 additions & 0 deletions rfc/operation-cache-warmer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
title: "Operation Cache Warmer"
author: Dustin Deus
date: 2024-08-25
status: Draft
---

# Operation Cache Warmer

- **Author:** Dustin Deus
- **Date:** 2024-08-25
- **Status:** Draft

## Abstract

This RFC describes a new feature to reduce the latency of the system by pre-planning the most expensive and requested operations before the router accepts traffic. We achieve this by computing the Top-N GraphQL operations available and making them available to all routers instances before they accept traffic.

## Motivation

GraphQL is a powerful tool to query data from a server. However, the flexibility of the query language comes with a cost. The cost is the complexity of the query and how expensive it is to normalize, plan and execute it. While execution performance is primarily a concern of the underlying subgraphs, the planning phase can be a unpredictable and significant latency contributor. The operation cache warmer aims to reduce this latency by pre-planning the most expensive and requested operations ahead to make it invisible to the user.

# Proposal

The distributed operation cache is semi-automatic and allows the user to push specific operations to the cache but also automatically computes the most expensive and requested operations of the last time frame (configurable). The cache has a fixed size of operations e.g. 100 (configurable) and is shared across all router instances. An operation can be a regular query, subscription, mutation or persisted operation. When the cache capacity is reached, manual operations have a higher priority than automatic operations. This allows users to manage the priority of operations in the cache themselves. It is possible that operations aren't compatible with all future schema changes. In that case, the operation is removed from the cache.

### Pushing operations to the cache

The User can push individual operations to the operation cache by using the CLI:

```bash
wgc federated-graph operation-cache add --graph mygraph --file operations.json

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about manual cache invalidation.

What will happen in the following scenario:

First, we run:

wgc federated-graph operation-cache add --graph mygraph --file operationA.json

Then we run:

wgc federated-graph operation-cache add --graph mygraph --file operationB.json

What will we have in the cache after the second operation? Will it be just Operation B or both Operation A and Operation B?

Curious, because if it is both Operation A and Operation B, the cache eventually will be filled with the manual operation and there will not be space for automatic operations unless we can invalidate it explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is right. Both operations will be in the "batch". This is the part where a customer can manage the cache themself and is responsible of it. At some point, we could also introduce a percentage limit about how much space can be reserved for manual pushed operations. In the first version, we want to keep it simple.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this command could be executed automatically (like part of CI/CD) or if we have to call it manually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is to run this locally as well as part of the CI/CD process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'd recommend to make the operation idempotent so that we can run this over and over again on each deployment without having duplicates.

Copy link
Contributor Author

@StarpTech StarpTech Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the idea. Operations are identified by the hash and appended to the "batch" until the cache is full. A user can repush the changes in another pipeline run to re-prioritize the list.

```

The CLI command will add the operations from the file `operations.json` to the operation cache of the graph `mygraph`. The file must contain a list of operations in JSON format. The operations can be queries, subscriptions, mutations or persisted operations.

```json5
[
// Queries
{
"body": "query { ... }"
},
// Persisted operation
{
"sha256Hash": "1234567890",
"body": "query { ... }",
}
]
```

The cli command is idempotent and always updates the cache with the latest operations. This doesn't trigger the computation of the Top-N operations which is done periodically by the Cosmo Platform.

### Automatic operation computation

At the same time, WunderGraph Cosmo is analyzing the incoming traffic based on the OpenTelemetry metrics that each router is sending. The Cosmo Platform computes the Top-N operations for each graph and combines it with the manually added operations. The Top-N operations are then pushed to the operation cache of the graph.

### Top-N computation

The Top-N computation is based on the following metrics:

- Total operation pre-execution time: Normalization, Validation, Planning
- Total request count

The Top-N computation is done for a specific time interval e.g. 3-72 hour (configurable). The operations are sorted by the pre-execution time and request count. The Top-N operations are then pushed to the operation cache. Manual operations have a higher priority than automatic operations. This means when the cache capacity is reached, manual operations are moved to the cache first and automatic operations are removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will Top-N prioritize between request count and pre-execution time? I could see wanting to prioritize slowest planned queries and then sort by request count or vice versa.

Copy link
Contributor Author

@StarpTech StarpTech Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ultimate goal is to prepare operations in advance, focusing on those that take the most time. We will only sort by request count when two items have the same pre-execution time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @joshlevinson is right in that we should prioritize by planning time and then request count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mean the same thing 😄


#### Example

The following example shows the Top-5 operations of a graph. The cache capacity is 5. The operations are sorted by the total pre-execution time and request count in descending order. There are three slots left in the cache where the Cosmo Platform can add automatic operations based on the Top-N computation.

```
Operation A: 400ms, 1000 requests (Manual added)
Operation B: 300ms, 500 requests (Automatic slot)
Operation C: 200ms, 200 requests (Automatic slot)
Operation D: 100ms, 100 requests (Manual added)
Operation E: 50ms, 50 requests (Automatic slot)
```

Alternatively, the user can add three more manual operations to the cache until the cache capacity is reached. This has the effect that no automatic operations can be added to the cache. In that case, we assume that the user knows better which operations are important.

### Cache update process

The router checks periodically e.g. every 5min for updates of the operation cache. The cache is checked explicitly when the router starts and when the schema changes. The cache is loaded and all operations are pre-planned before the router accepts traffic. The cache is updated in the background and doesn't block the router from accepting traffic.

### Platform integration

For containerized environments like Kubernetes, users should use the readiness probe to ensure that the router is ready to accept traffic. Setting not to small values for the readiness probe timeout is recommended to ensure that the router has enough time to prepare the cache. For schema updates after startup, this process is non-blocking because the new graph schema isn't swapped until the cache is warmed up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To double-check:
/health/ready will be successful only when the router is warmed up, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right!


### Cosmo UI integration

A User can disable the operation cache in the Cosmo UI. The User can see the current operations in the cache and remove them if necessary. The User can also see the current status of the cache and the last computation time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question:

what will happen if we disable the operation cache in the Cosmo UI but keep cache_warmup enabled in the router configuration?

Copy link
Contributor Author

@StarpTech StarpTech Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, Cosmo would no longer compute the latest TopN operations for you and your router will fetch at some point an outdated list of operations. It won't break anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that in this case, the Router would try to fetch operations for warmup, but the CP won't return anything, so nothing will be warmed up.

Copy link
Contributor Author

@StarpTech StarpTech Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest batch of operations has been pushed on the S3. The router will fetch for it on startup and schema changes. If nothing changes, the router will always fetch the latest available batch. There is no interaction with the controlplane.

This is one possibility. It would be the "best-effort" approach because there might be operations that can still benefit from the stale cache.

Another solution is to delete the cache after the user has disabled it in the Studio, in that case the router won't find any artifact and skip the warm up. I think this is the case that @jensneuse described.

This could be made configurable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to delete the warming data in S3 when the feature is disabled.

Copy link
Contributor Author

@StarpTech StarpTech Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with both ways. 👍 Let's purge it.


#### Triggering the computation manually

A User is able to trigger the computation of the Top-N operations manually in the Cosmo UI. This is useful for debugging purposes.

## Router configuration

The operation cache can be enabled or disabled in the router configuration file. The default is enabled. A valid Graph API key is required to fetch the operations cache from the Cosmo Platform.

```yaml
version: "1"

cache_warmup:
enabled: true
interval: 5m
```

_For this RFC, we only consider support for the WunderGraph Cosmo CDN._
Loading