cln-plugin: add a hook builder to make use of before, after and filters options #16

daywalker90 · 2025-11-10T13:55:00Z

No description provided.

cdecker · 2025-11-11T08:53:04Z

Interesting idea. I don't see any filtering yet as far as I can tell though, so I'm not sure if this filters on the plugin side (no gain here) or lightningd where it'd make a lot of sense to skip useless Json-rpc hook calls that aren't of interest.

daywalker90 · 2025-11-11T08:54:50Z

With this PR it does on lightningd's side

daywalker90 · 2025-11-11T08:58:32Z

But i wanted your review of my commit for cln-plugin 43a63fc so rusty can add it to the PR

cdecker · 2025-11-12T12:23:33Z

Gotcha, that way it looks very good, since it implements caller-filtering.

I noticed this in the logs: ``` listinvoices: description/bolt11/bolt12 not found ( {"jsonrpc":"2) ``` And we make the same formatting mistake in several places. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

bc4bb2b "libplugin: use jsonrpc_io logic for sync requests too." changed this message, and test was not updated. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Nobody has hit this yet, but we're about to with our tests. The size of the db is going to be whatever the total size of the tables are; bigger nodes, bigger db. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

… create index. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This reverts `bookkeeper: only read listchannelmoves 1000 entries at a time.` commit, so we can properly fix the scalability in the coming patches. tests/test_coinmoves.py::test_generate_coinmoves (100,000): Time (from start to end of l2 node): 207 seconds Worst latency: 106 seconds Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

… at once." This reverts commit 1dda0c0 so we can test what its like to be flooded with logs again. This benefits from other improvements we've made this release, to handling plugin input (i.e. converting to use common/jsonrpc_io), so this doesn't make much difference. tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3): Time (from start to end of l2 node): 211 seconds Worst latency: 108 seconds Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We start with 100,000 entries. We will scale this to 2M as we fix the O(N^2) bottlenecks. I measure the node time after we modify the db, like so: while guilt push && rm -rf /tmp/ltests* && uv run make -s RUST=0; do RUST=0 VALGRIND=0 TIMEOUT=100 TEST_DEBUG=1 eatmydata uv run pytest -vvv -p no:logging tests/test_coinmoves.py::test_generate_coinmoves > /tmp/`guilt top`-sql 2>&1; done Then analyzed the results with: FILE=/tmp/synthetic-data.patch-sql; START=$(grep 'lightningd-2 .* Server started with public key' $FILE | tail -n1 | cut -d\ -f2 | cut -d. -f1); END=$(grep 'lightningd-2 .* JSON-RPC shutdown' $FILE | tail -n1 | cut -d\ -f2 | cut -d. -f1); echo $(( $(date +%s -d $END) - $(date +%s -d $START) )); grep 'E assert' $FILE; tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3): Time (from start to end of l2 node): 85 seconds Worst latency: 75 seconds Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…rink it. We make a copy, then attach a destructor to the hook in case that plugin exits, so we can NULL it out in the local copy. When we have 300,000 requests pending, this means we have 300,000 destructors, which don't scale (it's a single-linked list). Simply NULL out (rather than shrink) the array in the `plugin_hook`. Then we can keep using that. tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3): Time (from start to end of l2 node): 34 seconds **WAS 85** Worst latency: 24 seconds **WAS 75** Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

If we add a new hook, not at the end, while hooks are getting called, then iteration could be messed up (e.g. calling a plugin twice, or skipping one). The simplest thing is to defer updates until nobody is calling the hook. In theory this could livelock, in practice it won't. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

If livelock ever *does* become an issue, we will see it in the logs. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

When we have many commands, this is where we spend all our time, and it's just for an old assertion. tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3): Time (from start to end of l2 node): 13 seconds **WAS 34** Worst latency: 4.0 seconds **WAS 24* Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…quests. If we have USDT compiled in, scanning the array of spans becomes prohibitive if we have really large numbers of requests. In the bookkeeper code, when catching up with 1.6M channel events, this became clear in profiling. Use a hash table instead. Before: tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3): Time (from start to end of l2 node): 269 seconds (vs 14 with HAVE_USDT=0) Worst latency: 4.0 seconds After: tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3): Time (from start to end of l2 node): 14 seconds Worst latency: 4.3 seconds Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…hey want. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…ss them. xpay is relying on the destructor to send another request. This means that it doesn't actually submit the request until *next time* we wake. This has been in xpay from the start, but it is not noticeable until xpay stops subscribing to every command on the rpc_command hook. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3): Time (from start to end of l2 node): 135 seconds **WAS 227** Worst latency: 12.1 seconds **WAS 62.4** Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Added: pyln-client: optional filters can be given when hooks are registered (for supported hooks)

Changelog-Added: Plugins: "filters" can be specified on the `custommsg` hook to limit what message types the hook will be called for. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

… issues. A client can do this by sending a large request, so this allows us to see what happens if they do that, even though 1MB (2MB buffer) is more than we need. This drives our performance through the floor: see next patch which gets us back on track. tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3): Time (from start to end of l2 node): 271 seconds **WAS 135** Worst latency: 105 seconds **WAS 12.1** Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We would keep parsing if we were out of tokens, even if we had actually finished one object! These are comparison against the "xpay: use filtering on rpc_command so we only get called on "pay"." not the disasterous previous one! tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3): Time (from start to end of l2 node): 126 seconds (was 135) Worst latency: 5.1 seconds **WAS 12.1** Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This rotates through fds explicitly, to avoid unfairness. This doesn't really make a difference until we start using it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Now that ccan/io rotates through callbacks, we can call io_always() to yield. Though it doesn't fire on our benchmark, it's a good thing to do. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Now that ccan/io rotates through callbacks, we can call io_always() to yield. We're now fast enough that this doesn't have any effect on this test, bit it's still good to have. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…nmoves each time. tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3): Time (from start to end of l2 node): 102 seconds **WAS 126** Worst latency: 4.5 seconds **WAS 5.1** Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We have a reasonable number of commands now, and we *already* keep a strmap for the usage strings. So simply keep the usage and the command in the map, and skip the array. tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3): Time (from start to end of l2 node): 95 seconds (was 102) Worst latency: 4.5 seconds tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, Postgres): Time (from start to end of l2 node): 231 seconds Worst latency: 4.8 seconds Note the values compare against 25.09.2 (Postgres): sqlite3: Time (from start to end of l2 node): 403 seconds Postgres: Time (from start to end of l2 node): 671 seconds Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Now we've found all the issues, the latency spike (4 seconds on my laptop) for querying 2M elements remains. Restore the limited sampling which we reverted, but make it 10,000 now. This doesn't help our worst-case latency, because sql still asks for all 2M entries on first access. We address that next. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This avoids latency spikes when we ask lightningd to give us 2M entries. tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3): Time (from start to end of l2 node): 88 seconds (was 95) Worst latency: 0.028 seconds **WAS 4.5** Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This is slow, but will make sure we find out if we add latency spikes in future. tests/test_coinmoves.py::test_generate_coinmoves (5,000,000, sqlite3): Time (from start to end of l2 node): 223 seconds Latency min/median/max: 0.0023 / 0.0033 / 0.113 seconds tests/test_coinmoves.py::test_generate_coinmoves (5,000,000, Postgres): Time (from start to end of l2 node): 470 seconds Worst latency: 0.0024 / 0.0098 / 0.124 seconds Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Fixed: lightningd: multiple signficant speedups for large nodes, especially preventing "freezes" under exceptionally high load.

To measure the improvement (if any) if we don't actually create empty transactions. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We always start a transaction before processing, but there are cases where we don't need to. Switch to doing it on-demand. This doesn't make a big difference for sqlite3, but it can for Postgres because of the latency: 12% or so. Every bit helps! 30 runs, min-max(mean+/-stddev): Postgres before: 8.842773-9.769030(9.19531+/-0.21) Postgres after: 8.007967-8.321856(8.14172+/-0.066) sqlite3 before: 7.486042-8.371831(8.15544+/-0.19) sqlite3 after: 7.973411-8.576135(8.3025+/-0.12) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

I had forgotten this file existed, but it needs tqdm and pytest-benchmark, so add those dev requirements. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…nd line_graph helpers. We want to use log-level info for benchmarking, for example. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Drop the log level, don't do extra memory checks, don't dump I/O. These are not realistic for testing non-development nodes. Here's the comparison, using: VALGRIND=0 eatmydata uv run pytest -v --benchmark-compare=0001_baseline tests/benchmark.py Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_invoice (NOW) 414.9430 (1.0) 39,115.6150 (12.35) 834.7296 (1.0) 2,274.1198 (6.59) 611.7745 (1.0) 162.0230 (1.0) 1;33 1,197.9927 (1.0) 290 1 test_invoice (0001_baselin) 951.9740 (2.29) 3,166.4061 (1.0) 1,366.7944 (1.64) 345.1460 (1.0) 1,328.6110 (2.17) 339.3517 (2.09) 48;15 731.6389 (0.61) 221 1 test_pay (NOW) 36,339.2329 (87.58) 69,477.8530 (21.94) 51,719.9459 (61.96) 8,033.4262 (23.28) 52,639.5601 (86.04) 9,590.1425 (59.19) 6;0 19.3349 (0.02) 17 1 test_pay (0001_baselin) 61,741.5591 (148.80) 108,801.6961 (34.36) 88,284.6752 (105.76) 15,875.4417 (46.00) 96,006.0760 (156.93) 27,500.9771 (169.74) 6;0 11.3270 (0.01) 13 1 test_single_payment (NOW) 46,721.4010 (112.60) 66,027.6250 (20.85) 56,699.4597 (67.93) 5,829.7234 (16.89) 54,659.9385 (89.35) 9,810.9820 (60.55) 6;0 17.6369 (0.01) 14 1 test_single_payment (0001_baselin) 52,215.3670 (125.84) 109,608.0400 (34.62) 74,521.8032 (89.28) 16,175.6833 (46.87) 72,881.5976 (119.13) 17,668.8581 (109.05) 4;1 13.4189 (0.01) 12 1 test_forward_payment (NOW) 108,338.2401 (261.09) 115,570.7800 (36.50) 111,353.7021 (133.40) 2,483.2338 (7.19) 111,981.6790 (183.04) 3,360.6182 (20.74) 3;0 8.9804 (0.01) 11 1 test_forward_payment (0001_baselin) 108,917.7490 (262.49) 168,348.2911 (53.17) 140,321.5990 (168.10) 22,375.2216 (64.83) 143,746.4900 (234.97) 36,363.4459 (224.43) 3;0 7.1265 (0.01) 7 1 test_start (NOW) 299,278.4000 (721.25) 330,340.2610 (104.33) 314,121.8292 (376.32) 11,385.4700 (32.99) 314,603.4899 (514.25) 13,876.4871 (85.65) 2;0 3.1835 (0.00) 5 1 test_start (0001_baselin) 305,928.9111 (737.28) 575,270.0820 (181.68) 419,496.8460 (502.55) 138,248.1937 (400.55) 334,207.0500 (546.29) 254,339.0035 (>1000.0) 2;0 2.3838 (0.00) 5 1 test_long_forward_payment (NOW) 1,088,077.8680 (>1000.0) 1,131,035.0260 (357.20) 1,108,896.7970 (>1000.0) 20,494.1195 (59.38) 1,098,544.8329 (>1000.0) 36,904.4899 (227.77) 3;0 0.9018 (0.00) 5 1 test_long_forward_payment (0001_baselin) 1,282,326.5721 (>1000.0) 1,450,350.8301 (458.04) 1,369,618.5776 (>1000.0) 73,432.8716 (212.76) 1,380,547.3910 (>1000.0) 132,647.3573 (818.69) 2;0 0.7301 (0.00) 5 1 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…in.py::test_spam_commands to benchmark.py Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This hammers lightingd with `listinvoices` commands. $ VALGRIND=0 TEST_DB_PROVIDER=postgres eatmydata uv run pytest -v tests/benchmark.py::test_spam_listcommands sqlite3: test_spam_listcommands 2.1193 2.4524 2.2343 0.1341 2.2229 0.1709 1;0 0.4476 5 1 PostgreSQL: test_spam_listcommands 6.5572 6.8440 6.7067 0.1032 6.6967 0.1063 2;0 0.1491 5 1 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Since we are the only writer, we don't need one. Name (time in s) Min Max Mean StdDev Median sqlite: test_spam_listcommands (before) 2.1193 2.4524 2.2343 0.1341 2.2229 sqlite: test_spam_listcommands (after) 2.0140 2.2349 2.1001 0.0893 2.0644 Postgres: test_spam_listcommands (before) 6.5572 6.8440 6.7067 0.1032 6.6967 Postgres: test_spam_listcommands (after) 4.4237 5.0024 4.6495 0.2278 4.6717 A nice 31% speedup! Changelog-Changed: Postgres: significant speedup on read-only operations (e.g. 30% on empty SELECTs) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We are seeing this in the CI logs, eg tests/test_connection.py::test_reconnect_sender_add1: lightningd-1 2025-11-17T05:48:00.665Z DEBUG jsonrpc#84: Pausing parsing after 1 requests followed by: lightningd-1 2025-11-17T05:48:02.068Z **BROKEN** 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: wake delay for WIRE_CHANNEL_REESTABLISH: 8512msec So, what is consuming lightningd for 8 or so seconds? This message helped diagnose that the issue was dev-memleak: fixed in a different branch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…rs options

daywalker90 requested a review from cdecker as a code owner November 10, 2025 13:55

daywalker90 mentioned this pull request Nov 10, 2025

Big nodes, big loads and lots of (overdue) optimizations ElementsProject/lightning#8677

Merged

rustyrussell force-pushed the guilt/bkpr-post-migration branch from 9bb2e61 to aec622c Compare November 10, 2025 23:55

rustyrussell force-pushed the guilt/bkpr-post-migration branch 11 times, most recently from 68d3dc3 to deae748 Compare November 20, 2025 01:37

rustyrussell added 12 commits November 20, 2025 16:30

bookkeeper: fix printing of bad JSON results.

ef5a9bd

I noticed this in the logs: ``` listinvoices: description/bolt11/bolt12 not found ( {"jsonrpc":"2) ``` And we make the same formatting mistake in several places. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: fix changed msg on plugin failure.

d66e7de

bc4bb2b "libplugin: use jsonrpc_io logic for sync requests too." changed this message, and test was not updated. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

plugins/sql: remove size limit.

6977800

Nobody has hit this yet, but we're about to with our tests. The size of the db is going to be whatever the total size of the tables are; bigger nodes, bigger db. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

plugins/sql: print times taken to do list comand, populate table, and…

459e213

… create index. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

lightningd: log when we have to defer hook registration.

576f48c

If livelock ever *does* become an issue, we will see it in the logs. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell and others added 28 commits November 20, 2025 16:30

libplugin: allow plugins to register optional filters for each hook t…

ebe5f2e

…hey want. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pyln-client: support hook filters.

81f0d05

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Added: pyln-client: optional filters can be given when hooks are registered (for supported hooks)

lightningd: allow filtering on custommsg hook too.

213cbba

Changelog-Added: Plugins: "filters" can be specified on the `custommsg` hook to limit what message types the hook will be called for. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

commando, chanbackup: use custommsg hooks.

aff1d6b

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

ccan: update to get io_loop fairness.

5fc9e5a

This rotates through fds explicitly, to avoid unfairness. This doesn't really make a difference until we start using it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

lightningd: don't process more than 100 commands from a JSONRPC at once.

b51e282

Now that ccan/io rotates through callbacks, we can call io_always() to yield. Though it doesn't fire on our benchmark, it's a good thing to do. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

plugins/sql: use modern data style, not globals.

b63034b

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

sql: if we use dev-sqlfilename, don't bother syncing it to disk.

bb7db39

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: test for 1M JSONRPC calls which don't need transactions.

b03562b

To measure the improvement (if any) if we don't actually create empty transactions. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Python: update pyproject.toml so we can run tests/benchmarks.py.

f49818c

I had forgotten this file existed, but it needs tqdm and pytest-benchmark, so add those dev requirements. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pyln-testing: don't assume we're doing debug logging for fundwallet a…

19f0b04

…nd line_graph helpers. We want to use log-level info for benchmarking, for example. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: move test_coinmoves.py::test_generate_coinmoves and test_plug…

ce425b4

…in.py::test_spam_commands to benchmark.py Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

cln-plugin: add a hook builder to make use of before, after and filte…

9a6486e

…rs options

daywalker90 force-pushed the rust-hook-builder branch from 43a63fc to 9a6486e Compare November 20, 2025 10:30

daywalker90 closed this Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cln-plugin: add a hook builder to make use of before, after and filters options #16

cln-plugin: add a hook builder to make use of before, after and filters options #16

Uh oh!

daywalker90 commented Nov 10, 2025

Uh oh!

cdecker commented Nov 11, 2025

Uh oh!

daywalker90 commented Nov 11, 2025

Uh oh!

daywalker90 commented Nov 11, 2025

Uh oh!

cdecker commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cln-plugin: add a hook builder to make use of before, after and filters options #16

cln-plugin: add a hook builder to make use of before, after and filters options #16

Uh oh!

Conversation

daywalker90 commented Nov 10, 2025

Uh oh!

cdecker commented Nov 11, 2025

Uh oh!

daywalker90 commented Nov 11, 2025

Uh oh!

daywalker90 commented Nov 11, 2025

Uh oh!

cdecker commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants