CCIP-2003 Improved fetching Commit Reports from database #726

mateusz-sekara · 2024-04-17T08:07:56Z

Motivation

Current implementation fetches all the CommitReports within the permissionlessExecThreshold from database and filter out result set based on the snoozedRoots. It means that when everything is executed, we keep fetching all roots to memory only to discover that we don't need that data. During high traffic spikes, this is an unnecessary waste of resources in the CL node and database as the dataset size grows in time till it reaches permissionlessExec window (8 hours of logs for prod).

Example load run in which you can see how it changes over time during steady traffic of messages

Solution

There are at least two solutions to make it more performant

Pass executed roots to the database query to fetch only unexpired roots from the database. Unexpired roots can be determined from the snoozedRoots cache. This is probably the most accurate approach because it always fetches the minimal required set from the database.

You can think of the query like this

select * from evm.logs
where address = :addr
and event_sig = :event_sig
and substring(data, X, Y) not in (:executed_roots_array)
and block_timestamp > :permissionlessThresholdWindow

Make permissionlessThreshold filter "sliding". Commit Reports are executed sequentially for most of the cases, so we can keep track of timestamps of the oldest fully executed report. Instead of inventing a new query for LogPoller we can achieve a lot based on the assumption that execution is sequential. Whenever report is markedAsExecuted we keep bumping block_timestamp filter. We already have that information in cache and indirectly rely on that when filtering out snoozedRoots

and block_timestamp > :oldest_executed_timestamp

Solution 2 is easier to implement, doesn't require any changes to LogPoller (and query perf testing) and can be handled in the plugin. We already cache snoozedRoots so we can reuse that to move filtering to the database by narrowing the block_timestamp search window. Assuming that reports are executed sequentially, it should be as accurate as solution 1.

Load testing solution 2 (the same configuration, but shorter test duration)

github-actions · 2024-04-17T08:08:13Z

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset.

core/services/ocr2/plugins/ccip/internal/cache/commit_roots.go

dimkouv · 2024-04-22T12:51:29Z

core/services/ocr2/plugins/ccip/internal/cache/commit_roots.go

+	// Now the search filter wil be 0xA timestamp -> [2010-10-11, 20-10-15]
+	// If roots are executed out of order, it's not going to change anything. However, for most of the cases we have sequential root execution and that is
+	// a huge improvement because we don't need to fetch all the roots from the database in every round.
+	unexecutedRootsQueue *orderedmap.OrderedMap[string, time.Time]


I would prefer not to use this lib.
We could have our own simple implementation or use a different type.

If we have our own simple implementation you can also have the rootsQueueMu defined there.

It was already in the deps so I've decided to reuse it. Do you have any better ideas? I'm not sure about reimplementing, sounds like reinventing the wheel.

Alternative implementation would be this https://github.com/elliotchance/orderedmap (more GH stars)

core/services/ocr2/plugins/ccip/internal/cache/commit_roots.go

Current implementation fetches all the CommitReports within the permissionlessExecThreshold from database and filter out result set based on the `snoozedRoots`. It means that when everything is executed, we keep fetching all roots to memory only to discover that we don't need that data. During high traffic spikes, this is an unnecessary waste of resources in the CL node and database as the dataset size grows in time till it reaches permissionlessExec window (8 hours of logs for prod). Example load run in which you can see how it changes over time during steady traffic of messages <img width="1488" alt="image" src="https://github.com/smartcontractkit/ccip/assets/18304098/7b216dd4-1ab3-4943-9797-99a38f4f9eaa"> There are at least two solutions to make it more performant 1. Pass executed roots to the database query to fetch only unexpired roots from the database. Unexpired roots can be determined from the snoozedRoots cache. This is probably the most accurate approach because it always fetches the minimal required set from the database. You can think of the query like this ```sql select * from evm.logs where address = :addr and event_sig = :event_sig and substring(data, X, Y) not in (:executed_roots_array) and block_timestamp > :permissionlessThresholdWindow ``` 2. Make permissionlessThreshold filter "sliding". Commit Reports are executed sequentially for most of the cases, so we can keep track of timestamps of the oldest fully executed report. Instead of inventing a new query for LogPoller we can achieve a lot based on the assumption that execution is sequential. Whenever report is markedAsExecuted we keep bumping block_timestamp filter. We already have that information in cache and indirectly rely on that when filtering out snoozedRoots ```sql and block_timestamp > :oldest_executed_timestamp ``` Solution 2 is easier to implement, doesn't require any changes to LogPoller (and query perf testing) and can be handled in the plugin. We already cache snoozedRoots so we can reuse that to move filtering to the database by narrowing the block_timestamp search window. Assuming that reports are executed sequentially, it should be as accurate as solution 1. duration) <img width="1088" alt="image" src="https://github.com/smartcontractkit/ccip/assets/18304098/c445495d-51e7-4138-a00b-662423fa0cdf">

* Adding evm_chain_id, address and event_sig to data and topic indexes smartcontractkit/chainlink#12786 * Improved fetching Commit Reports from database #726

) ## Motivation Detailed context is here #726, but long story short scanning the entire `permissionLessExecThreshold` / `messageIntervalVisibility` from every Commit plugin instance (think of it as a lane) puts a lot of pressure to the database. For most of the cases, we don't need the entire window but only Roots that are not executed yet. Unfortunately, the solution suggested in the mentioned PR was faulty because it relied on `block_timestamp` of unfinalized blocks. Re-orgs on Polygon exposed that problem, details and revert here #1155 ## Solution Divide the cache into two layers * finalized commit roots are kept in memory because we assume these never change * unfinalized commit roots are fetched from LogPoller, and finalized logs are promoted to the finalized part There should be no additional memory footprint because we kept all executed and finalized roots in cache memory anyway. The performance impact of this approach should be similar to the #726 , because we keep scanning only unfinalized chunks of the LogPoller so the number of logs fetched from the database would be very limited. However, there is one caveat here, LogPoller doesn't mark logs as finalized/not finalized by default so we need to ask for the latest finalized block in the Commit Reader directly and enrich logs with that information (similar approach to what we do for Offramp reader). Thankfully, asking the database for the latest log is extremely fast and (if needed) could be optimized by caching latest finalized block in the LogPoller/HeadTracker. There will be load tests conducted here to confirm performance assumptions. Please see the code and tests for more details chainlink-common PR smartcontractkit/chainlink-common#650 ### High level picture ![image](https://github.com/smartcontractkit/ccip/assets/18304098/5c7e04e7-9703-4228-ac10-d264ab6e184a) ### Alternative solutions: [I've tried implementing the solution in which the cache is simplified and keeps only executed/snoozed roots](#1170). Snoozed Roots are passed to the database query and used for filtering out executed and finalized roots. However, it works only for some number of snoozed roots. After adding more than 250 snoozed roots to the LogPoller query, the driver/database started to cut off some of the filters making it unreliable (erroring or not filtering properly). Also, there is a risk of hitting the Postgres query size limit (because every root has to be passed in SQL) which might lead to some limbo state which the plugin can't recover from. ## Tests ### Before - fetching all CommitReports always <img width="1293" alt="image" src="https://github.com/user-attachments/assets/ea32eccb-8d76-468c-9018-01d4989968f7"> ### After - fetching only unfinalized CommitReports and caching finalized ones <img width="1283" alt="image" src="https://github.com/user-attachments/assets/3732f438-063e-4e8f-86a8-849684e88970"> ### Summary Caching roots improved the performance, query is smaller in terms of the objects scanned and returned. Number of roots returned is constant almost during the entire test (around ~5 reports at once) Sightly higher memory footprint (around 800MB higher) but lower CPU utilization during tests (~15% drop) <img width="2500" alt="image" src="https://github.com/user-attachments/assets/027cb379-03ec-419c-a289-7b1df3e25cd8">

## Motivation Current implementation fetches all the CommitReports within the permissionlessExecThreshold from database and filter out result set based on the `snoozedRoots`. It means that when everything is executed, we keep fetching all roots to memory only to discover that we don't need that data. During high traffic spikes, this is an unnecessary waste of resources in the CL node and database as the dataset size grows in time till it reaches permissionlessExec window (8 hours of logs for prod). Example load run in which you can see how it changes over time during steady traffic of messages <img width="1488" alt="image" src="https://github.com/smartcontractkit/ccip/assets/18304098/7b216dd4-1ab3-4943-9797-99a38f4f9eaa"> ## Solution There are at least two solutions to make it more performant 1. Pass executed roots to the database query to fetch only unexpired roots from the database. Unexpired roots can be determined from the snoozedRoots cache. This is probably the most accurate approach because it always fetches the minimal required set from the database. You can think of the query like this ```sql select * from evm.logs where address = :addr and event_sig = :event_sig and substring(data, X, Y) not in (:executed_roots_array) and block_timestamp > :permissionlessThresholdWindow ``` 2. Make permissionlessThreshold filter "sliding". Commit Reports are executed sequentially for most of the cases, so we can keep track of timestamps of the oldest fully executed report. Instead of inventing a new query for LogPoller we can achieve a lot based on the assumption that execution is sequential. Whenever report is markedAsExecuted we keep bumping block_timestamp filter. We already have that information in cache and indirectly rely on that when filtering out snoozedRoots ```sql and block_timestamp > :oldest_executed_timestamp ``` Solution 2 is easier to implement, doesn't require any changes to LogPoller (and query perf testing) and can be handled in the plugin. We already cache snoozedRoots so we can reuse that to move filtering to the database by narrowing the block_timestamp search window. Assuming that reports are executed sequentially, it should be as accurate as solution 1. ### Load testing solution 2 (the same configuration, but shorter test duration) <img width="1088" alt="image" src="https://github.com/smartcontractkit/ccip/assets/18304098/c445495d-51e7-4138-a00b-662423fa0cdf">

) ## Motivation Detailed context is here #726, but long story short scanning the entire `permissionLessExecThreshold` / `messageIntervalVisibility` from every Commit plugin instance (think of it as a lane) puts a lot of pressure to the database. For most of the cases, we don't need the entire window but only Roots that are not executed yet. Unfortunately, the solution suggested in the mentioned PR was faulty because it relied on `block_timestamp` of unfinalized blocks. Re-orgs on Polygon exposed that problem, details and revert here #1155 ## Solution Divide the cache into two layers * finalized commit roots are kept in memory because we assume these never change * unfinalized commit roots are fetched from LogPoller, and finalized logs are promoted to the finalized part There should be no additional memory footprint because we kept all executed and finalized roots in cache memory anyway. The performance impact of this approach should be similar to the #726 , because we keep scanning only unfinalized chunks of the LogPoller so the number of logs fetched from the database would be very limited. However, there is one caveat here, LogPoller doesn't mark logs as finalized/not finalized by default so we need to ask for the latest finalized block in the Commit Reader directly and enrich logs with that information (similar approach to what we do for Offramp reader). Thankfully, asking the database for the latest log is extremely fast and (if needed) could be optimized by caching latest finalized block in the LogPoller/HeadTracker. There will be load tests conducted here to confirm performance assumptions. Please see the code and tests for more details chainlink-common PR smartcontractkit/chainlink-common#650 ### High level picture ![image](https://github.com/smartcontractkit/ccip/assets/18304098/5c7e04e7-9703-4228-ac10-d264ab6e184a) ### Alternative solutions: [I've tried implementing the solution in which the cache is simplified and keeps only executed/snoozed roots](#1170). Snoozed Roots are passed to the database query and used for filtering out executed and finalized roots. However, it works only for some number of snoozed roots. After adding more than 250 snoozed roots to the LogPoller query, the driver/database started to cut off some of the filters making it unreliable (erroring or not filtering properly). Also, there is a risk of hitting the Postgres query size limit (because every root has to be passed in SQL) which might lead to some limbo state which the plugin can't recover from. ## Tests ### Before - fetching all CommitReports always <img width="1293" alt="image" src="https://github.com/user-attachments/assets/ea32eccb-8d76-468c-9018-01d4989968f7"> ### After - fetching only unfinalized CommitReports and caching finalized ones <img width="1283" alt="image" src="https://github.com/user-attachments/assets/3732f438-063e-4e8f-86a8-849684e88970"> ### Summary Caching roots improved the performance, query is smaller in terms of the objects scanned and returned. Number of roots returned is constant almost during the entire test (around ~5 reports at once) Sightly higher memory footprint (around 800MB higher) but lower CPU utilization during tests (~15% drop) <img width="2500" alt="image" src="https://github.com/user-attachments/assets/027cb379-03ec-419c-a289-7b1df3e25cd8">

mateusz-sekara temporarily deployed to sdlc April 17, 2024 08:08 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc April 17, 2024 08:23 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc April 17, 2024 09:51 — with GitHub Actions Inactive

mateusz-sekara temporarily deployed to sdlc April 17, 2024 09:59 — with GitHub Actions Inactive

mateusz-sekara added 3 commits April 17, 2024 12:38

Better caching of snooze roots

5e172f3

Added locking and threadsafety

37d54f1

Changeset

e475840

mateusz-sekara force-pushed the better-fetching-commitroots branch from ca6249a to 3bb3ce1 Compare April 17, 2024 10:39

mateusz-sekara temporarily deployed to sdlc April 17, 2024 10:39 — with GitHub Actions Inactive

mateusz-sekara force-pushed the better-fetching-commitroots branch from 3bb3ce1 to bfd09a1 Compare April 17, 2024 10:42

mateusz-sekara temporarily deployed to sdlc April 17, 2024 10:42 — with GitHub Actions Inactive

Tests

5baa30c

mateusz-sekara force-pushed the better-fetching-commitroots branch from bfd09a1 to 5baa30c Compare April 17, 2024 10:46

mateusz-sekara temporarily deployed to sdlc April 17, 2024 10:46 — with GitHub Actions Inactive

Minor refactor

9e76d31

mateusz-sekara temporarily deployed to sdlc April 17, 2024 11:03 — with GitHub Actions Inactive

Minor refactor

6ce6ce7

mateusz-sekara temporarily deployed to sdlc April 17, 2024 11:20 — with GitHub Actions Inactive

Merge branch 'ccip-develop' into better-fetching-commitroots

8f83c6e

mateusz-sekara temporarily deployed to sdlc April 17, 2024 11:23 — with GitHub Actions Inactive

mateusz-sekara marked this pull request as ready for review April 17, 2024 12:29

mateusz-sekara requested a review from a team as a code owner April 17, 2024 12:29

Minor fixes

cfd6955

mateusz-sekara temporarily deployed to sdlc April 18, 2024 12:32 — with GitHub Actions Inactive

dimkouv self-requested a review April 18, 2024 14:59

mateusz-sekara requested a review from connorwstein April 18, 2024 15:15

dimkouv reviewed Apr 22, 2024

View reviewed changes

Post review fixes

1ec1533

mateusz-sekara temporarily deployed to sdlc April 23, 2024 06:11 — with GitHub Actions Inactive

mateusz-sekara requested a review from dimkouv April 23, 2024 06:13

dimkouv approved these changes Apr 23, 2024

View reviewed changes

mateusz-sekara merged commit ab356a4 into ccip-develop Apr 23, 2024
85 checks passed

mateusz-sekara deleted the better-fetching-commitroots branch April 23, 2024 08:01

mateusz-sekara mentioned this pull request Apr 23, 2024

Cherry-picking performance fixes to the release branch #744

Merged

mateusz-sekara changed the title ~~Improved fetching Commit Reports from database~~ CCIP-2003 Improved fetching Commit Reports from database May 15, 2024

This was referenced Jul 10, 2024

Test to verify if filtering by merkleRoot works #1170

Closed

CCIP-2791 Long term solution for snoozing and caching CommitRoots #1182

Merged

0xnogo mentioned this pull request Jul 31, 2024

Merge Chainlink v2.14.0 #1241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCIP-2003 Improved fetching Commit Reports from database #726

CCIP-2003 Improved fetching Commit Reports from database #726

mateusz-sekara commented Apr 17, 2024 •

edited

Loading

github-actions bot commented Apr 17, 2024

dimkouv Apr 22, 2024

dimkouv Apr 22, 2024

mateusz-sekara Apr 22, 2024

mateusz-sekara Apr 23, 2024

CCIP-2003 Improved fetching Commit Reports from database #726

CCIP-2003 Improved fetching Commit Reports from database #726

Conversation

mateusz-sekara commented Apr 17, 2024 • edited Loading

Motivation

Solution

Load testing solution 2 (the same configuration, but shorter test duration)

github-actions bot commented Apr 17, 2024

dimkouv Apr 22, 2024

Choose a reason for hiding this comment

dimkouv Apr 22, 2024

Choose a reason for hiding this comment

mateusz-sekara Apr 22, 2024

Choose a reason for hiding this comment

mateusz-sekara Apr 23, 2024

Choose a reason for hiding this comment

mateusz-sekara commented Apr 17, 2024 •

edited

Loading