Refactor events subsystem #7000

Stebalien · 2021-08-06T23:29:21Z

This builds on #6974.

Changes

Pass contexts everywhere instead of storing them.
Handle "catch up" if the chain notify channel is closed.
Unify events around a single "observer" system. To simplify things, observers are passed both a to and a from tipset instead of just a "target" tipset. On reverts, this "target" tipset was actually the from tipset (the tipset being reverted) which was confusing.
Improve the cache:
1. Validate inputs. Given the "catch up" handling, this may not be necessary.
2. Expose the normal lotus API. This removes the temptation to use the cache as a source of truth.
3. Ensure that we're always able to call through the cache when necessary (instead of just failing when something isn't in the cache).
4. Use this caching API everywhere, instead of explicitly using the cache.
Improve the height system.
1. Simplify handler management.
2. Implement handler GC.
Improve the hcEvents system.
1. Correctly handle reverts (I think?). Previously, a revert from A to B, then an apply from B to C, would check for changes from A to C, not changes from B to C (it would store the last tipset as A instead of B).
2. Store the tipsets instead of the heights for queued events (faster, less error cases, etc.).
Propegate errors when constructing the events system.
Handle pure reverts. Technically, a user can call SetHead to revert to a parent without applying any new blocks. Before, we would have restarted and likely would have corrupted the event state.
Optimized message handling (i.e., avoided re-hashing messages).

Architecture

The architecture isn't significantly different from the current one.

Observer: subscribes to head changes and calls apply/revert on event subsystems.
Cache: Caches messages (LRU); tipsets and head on the current chain (by subscribing to the observer).
Event Subsystems: Hook into the observer to react to chain changes.

[Lotus Node]
   ^     | (head change notifications)
   |     |
  cache<-| -------------------+
  ^ ^    |                    |
  | |    v                    |
  |observer                   |
  |   | (apply/revert events) |
  |  / \______________________+
  | v
[Event Subsystems] # height events, message events, state events, etc...

The main problem with this architecture is that it's entirely synchronous (callbacks). I'd prefer to use channels, but that's a larger refactor and I assume there are assumptions being made about everything happening in lock-step.

Future

Future improvements:

Reduce locking further.
Simplify the "ChainAt" function. The lock dance we're playing there is likely overkill.
We could probably re-implement much of hcEvents in-terms of more generalized events (e.g., we might be able to use the height event system).
Propagate contexts through StateChanged
Maybe remove the cache? We only need it for looking up "lookback" tipsets on timeout in the hcEvents system. Everywhere else, we just remember the tipsets we need for later as we get them.

TODO (before merge)

Improve test coverage (at least test the fixed bugs/edge cases).frob

chain/events/events_test.go

Stebalien · 2021-08-11T06:57:02Z

Simplify the "ChainAt" function. The lock dance we're playing there is likely overkill.

I'm happy to take a crack at this now if deemed necessary.

chain/events/cache.go

hunjixin · 2021-08-12T08:51:16Z

chain/events/events_called.go

-			}
-			me.blockMsgCache.Add(tsb.Cid(), msgsI)
+	for i, tsb := range ts.Cids() {
+		msgs, err := me.cs.ChainGetBlockMessages(context.TODO(), tsb)


get message block by block. but in fact there are many messages not to be executed or duplicate. could use ChainGetMessagesInTipset here?

That's a good point, but something for a followup patch. I'm replicating the current behavior here.

Hm. Actually, I think this code is correct as-is. We want to know if a message was included in the tipset, even if it wasn't executed. That way we can check to see if the message successfully applied. Otherwise, we'll keep waiting for the message to show up even though it never will.

Any context here @magik6k?

The ideal/intended behavior is:

Repriced messages should be applied (otherwise repricing messages to get them unstuck will only make things worse)

If a different message with the same nonce was executed after the desired confidence we should tell the api consumer about that.

I'm a bit out of my depth here. Does anything need to be changed?

I think this is fine for now as it doesn't seem to change the current behavior

vyzo

First pass, this looks sane so far.

I'll have to do a second pass to focus on the details, as this is a rather complex patch.

chain/events/observer.go

codecov · 2021-08-20T22:16:20Z

Codecov Report

Merging #7000 (1cf556c) into master (d1a68df) will decrease coverage by 0.00%.
The diff coverage is 76.51%.

❗ Current head 1cf556c differs from pull request most recent head 1da59fa. Consider uploading reports for the commit 1da59fa to get more accurate results

@@            Coverage Diff             @@
##           master    #7000      +/-   ##
==========================================
- Coverage   39.05%   39.05%   -0.01%     
==========================================
  Files         607      610       +3     
  Lines       64625    64716      +91     
==========================================
+ Hits        25242    25274      +32     
- Misses      35000    35044      +44     
- Partials     4383     4398      +15

Impacted Files	Coverage Δ
build/params_mainnet.go	`71.42% <ø> (ø)`
build/params_shared_vals.go	`71.42% <ø> (ø)`
chain/events/utils.go	`0.00% <0.00%> (ø)`
cmd/lotus/daemon.go	`0.00% <ø> (ø)`
extern/sector-storage/ffiwrapper/prover_cgo.go	`100.00% <ø> (ø)`
extern/sector-storage/ffiwrapper/sealer_cgo.go	`60.91% <ø> (ø)`
extern/sector-storage/ffiwrapper/verifier_cgo.go	`74.66% <ø> (ø)`
gateway/node.go	`48.43% <0.00%> (-0.51%)`	⬇️
lib/ulimit/ulimit_unix.go	`100.00% <ø> (ø)`
tools/stats/rpc.go	`0.00% <0.00%> (ø)`
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1a68df...1da59fa. Read the comment docs.

magik6k

Looks good, definitely better than the current code which definitely needs this cleanup.

The only thing which needs to be done here, other than resolving conflicts, is adding ChainGetPath to lotus-gateway, otherwise we'll break lotus-lite.

(also some nits / notes on changed behavior, but I'm fine with those)

itests/api_test.go

chain/events/events_height.go

magik6k · 2021-08-27T14:52:42Z

chain/events/events_height.go

-		confidence: confidence,
+		// Stash the tipset for future triggers.
+		for _, handler := range tipsets {
+			handler.ts = to


This will make the final call with a tipset that's potentially later than what the previous code would do (not AtOrAfter the specified height), but given how those tipsets are used, that's likely not an issue

You sure about that? This will stash the first non-null tipset after the target height. If there's a reorg, we'll set this to nil then re-set it to the new tipset.

chain/events/events_test.go

The reader can just re-subscribe when they're ready to catch up. This prevents a slow reader from bogging down the entire system.

This lets us always call check (accurately).

Stebalien commented Aug 9, 2021

View reviewed changes

chain/events/events_test.go Outdated Show resolved Hide resolved

Stebalien force-pushed the feat/refactor-events branch from e0ffe9f to ed83b8a Compare August 11, 2021 06:54

Stebalien marked this pull request as ready for review August 11, 2021 06:56

Stebalien requested a review from a team as a code owner August 11, 2021 06:56

Stebalien mentioned this pull request Aug 11, 2021

Unsubscribe slow chain head subscribers #6974

Closed

Stebalien force-pushed the feat/refactor-events branch from ed83b8a to 3ae06e3 Compare August 11, 2021 18:33

Stebalien requested review from magik6k and vyzo and removed request for magik6k August 11, 2021 23:37

hunjixin reviewed Aug 12, 2021

View reviewed changes

chain/events/cache.go Show resolved Hide resolved

hunjixin reviewed Aug 12, 2021

View reviewed changes

vyzo reviewed Aug 12, 2021

View reviewed changes

chain/events/observer.go Show resolved Hide resolved

Stebalien force-pushed the feat/refactor-events branch 2 times, most recently from bd3f313 to aedba70 Compare August 20, 2021 22:07

Stebalien force-pushed the feat/refactor-events branch from aedba70 to 4de7312 Compare August 25, 2021 19:34

dirkmc mentioned this pull request Aug 26, 2021

fix events API timeout handling for nil blocks #7184

Merged

magik6k reviewed Aug 27, 2021

View reviewed changes

Stebalien added 8 commits August 30, 2021 16:43

chore: dedup datastore import

14754f1

fix: close chain head subscription when the reader is slow

43bbde1

The reader can just re-subscribe when they're ready to catch up. This prevents a slow reader from bogging down the entire system.

fix: check parents when adding tipsets to the "cache"

a875e9b

refactor events system

3846170

test: improve chain event tests

82ac0a2

fix: atomically get head when registering an observer

f518e34

This lets us always call check (accurately).

fix: address review

003eae8

feat: expose ChainGetPath on the gateway

1cf556c

Stebalien force-pushed the feat/refactor-events branch from bb2ab3e to 1cf556c Compare August 30, 2021 23:49

fix events API timeout handling for nil blocks (#7184)

1da59fa

Stebalien mentioned this pull request Aug 31, 2021

revert OnDealExpiredOrSlashed changes filecoin-project/go-fil-markets#620

Merged

magik6k approved these changes Aug 31, 2021

View reviewed changes

magik6k merged commit b0f57d7 into master Aug 31, 2021

magik6k deleted the feat/refactor-events branch August 31, 2021 10:02

This was referenced Aug 31, 2021

Close ChainNotify channel when it blocks #6947

Closed

Lotus' api ChainNotify sometimes keeps blocking #6883

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor events subsystem #7000

Refactor events subsystem #7000

Stebalien commented Aug 6, 2021 •

edited

Loading

Stebalien commented Aug 11, 2021

hunjixin Aug 12, 2021

Stebalien Aug 12, 2021

Stebalien Aug 12, 2021

magik6k Aug 27, 2021

Stebalien Aug 27, 2021

magik6k Aug 31, 2021 •

edited

Loading

vyzo left a comment •

edited

Loading

codecov bot commented Aug 20, 2021 •

edited

Loading

magik6k left a comment

magik6k Aug 27, 2021

Stebalien Aug 27, 2021

Refactor events subsystem #7000

Refactor events subsystem #7000

Conversation

Stebalien commented Aug 6, 2021 • edited Loading

Changes

Architecture

Future

TODO (before merge)

Stebalien commented Aug 11, 2021

hunjixin Aug 12, 2021

Choose a reason for hiding this comment

Stebalien Aug 12, 2021

Choose a reason for hiding this comment

Stebalien Aug 12, 2021

Choose a reason for hiding this comment

magik6k Aug 27, 2021

Choose a reason for hiding this comment

Stebalien Aug 27, 2021

Choose a reason for hiding this comment

magik6k Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

vyzo left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Aug 20, 2021 • edited Loading

Codecov Report

magik6k left a comment

Choose a reason for hiding this comment

magik6k Aug 27, 2021

Choose a reason for hiding this comment

Stebalien Aug 27, 2021

Choose a reason for hiding this comment

Stebalien commented Aug 6, 2021 •

edited

Loading

magik6k Aug 31, 2021 •

edited

Loading

vyzo left a comment •

edited

Loading

codecov bot commented Aug 20, 2021 •

edited

Loading