Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor events subsystem #7000

Merged
merged 9 commits into from
Aug 31, 2021
Merged

Refactor events subsystem #7000

merged 9 commits into from
Aug 31, 2021

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Aug 6, 2021

This builds on #6974.

Changes

  1. Pass contexts everywhere instead of storing them.
  2. Handle "catch up" if the chain notify channel is closed.
  3. Unify events around a single "observer" system. To simplify things, observers are passed both a to and a from tipset instead of just a "target" tipset. On reverts, this "target" tipset was actually the from tipset (the tipset being reverted) which was confusing.
  4. Improve the cache:
    1. Validate inputs. Given the "catch up" handling, this may not be necessary.
    2. Expose the normal lotus API. This removes the temptation to use the cache as a source of truth.
    3. Ensure that we're always able to call through the cache when necessary (instead of just failing when something isn't in the cache).
    4. Use this caching API everywhere, instead of explicitly using the cache.
  5. Improve the height system.
    1. Simplify handler management.
    2. Implement handler GC.
  6. Improve the hcEvents system.
    1. Correctly handle reverts (I think?). Previously, a revert from A to B, then an apply from B to C, would check for changes from A to C, not changes from B to C (it would store the last tipset as A instead of B).
    2. Store the tipsets instead of the heights for queued events (faster, less error cases, etc.).
  7. Propegate errors when constructing the events system.
  8. Handle pure reverts. Technically, a user can call SetHead to revert to a parent without applying any new blocks. Before, we would have restarted and likely would have corrupted the event state.
  9. Optimized message handling (i.e., avoided re-hashing messages).

Architecture

The architecture isn't significantly different from the current one.

  1. Observer: subscribes to head changes and calls apply/revert on event subsystems.
  2. Cache: Caches messages (LRU); tipsets and head on the current chain (by subscribing to the observer).
  3. Event Subsystems: Hook into the observer to react to chain changes.
[Lotus Node]
   ^     | (head change notifications)
   |     |
  cache<-| -------------------+
  ^ ^    |                    |
  | |    v                    |
  |observer                   |
  |   | (apply/revert events) |
  |  / \______________________+
  | v
[Event Subsystems] # height events, message events, state events, etc...

The main problem with this architecture is that it's entirely synchronous (callbacks). I'd prefer to use channels, but that's a larger refactor and I assume there are assumptions being made about everything happening in lock-step.

Future

Future improvements:

  • Reduce locking further.
  • Simplify the "ChainAt" function. The lock dance we're playing there is likely overkill.
  • We could probably re-implement much of hcEvents in-terms of more generalized events (e.g., we might be able to use the height event system).
  • Propagate contexts through StateChanged
  • Maybe remove the cache? We only need it for looking up "lookback" tipsets on timeout in the hcEvents system. Everywhere else, we just remember the tipsets we need for later as we get them.

TODO (before merge)

Improve test coverage (at least test the fixed bugs/edge cases).frob

chain/events/events_test.go Outdated Show resolved Hide resolved
@Stebalien Stebalien marked this pull request as ready for review August 11, 2021 06:56
@Stebalien Stebalien requested a review from a team as a code owner August 11, 2021 06:56
@Stebalien
Copy link
Member Author

Simplify the "ChainAt" function. The lock dance we're playing there is likely overkill.

I'm happy to take a crack at this now if deemed necessary.

}
me.blockMsgCache.Add(tsb.Cid(), msgsI)
for i, tsb := range ts.Cids() {
msgs, err := me.cs.ChainGetBlockMessages(context.TODO(), tsb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get message block by block. but in fact there are many messages not to be executed or duplicate. could use ChainGetMessagesInTipset here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, but something for a followup patch. I'm replicating the current behavior here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Actually, I think this code is correct as-is. We want to know if a message was included in the tipset, even if it wasn't executed. That way we can check to see if the message successfully applied. Otherwise, we'll keep waiting for the message to show up even though it never will.

Any context here @magik6k?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ideal/intended behavior is:

  • Repriced messages should be applied (otherwise repricing messages to get them unstuck will only make things worse)
  • If a different message with the same nonce was executed after the desired confidence we should tell the api consumer about that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit out of my depth here. Does anything need to be changed?

Copy link
Contributor

@magik6k magik6k Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now as it doesn't seem to change the current behavior

Copy link
Contributor

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, this looks sane so far.

I'll have to do a second pass to focus on the details, as this is a rather complex patch.

chain/events/observer.go Show resolved Hide resolved
@Stebalien Stebalien force-pushed the feat/refactor-events branch 2 times, most recently from bd3f313 to aedba70 Compare August 20, 2021 22:07
@codecov
Copy link

codecov bot commented Aug 20, 2021

Codecov Report

Merging #7000 (1cf556c) into master (d1a68df) will decrease coverage by 0.00%.
The diff coverage is 76.51%.

❗ Current head 1cf556c differs from pull request most recent head 1da59fa. Consider uploading reports for the commit 1da59fa to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7000      +/-   ##
==========================================
- Coverage   39.05%   39.05%   -0.01%     
==========================================
  Files         607      610       +3     
  Lines       64625    64716      +91     
==========================================
+ Hits        25242    25274      +32     
- Misses      35000    35044      +44     
- Partials     4383     4398      +15     
Impacted Files Coverage Δ
build/params_mainnet.go 71.42% <ø> (ø)
build/params_shared_vals.go 71.42% <ø> (ø)
chain/events/utils.go 0.00% <0.00%> (ø)
cmd/lotus/daemon.go 0.00% <ø> (ø)
extern/sector-storage/ffiwrapper/prover_cgo.go 100.00% <ø> (ø)
extern/sector-storage/ffiwrapper/sealer_cgo.go 60.91% <ø> (ø)
extern/sector-storage/ffiwrapper/verifier_cgo.go 74.66% <ø> (ø)
gateway/node.go 48.43% <0.00%> (-0.51%) ⬇️
lib/ulimit/ulimit_unix.go 100.00% <ø> (ø)
tools/stats/rpc.go 0.00% <0.00%> (ø)
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1a68df...1da59fa. Read the comment docs.

Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, definitely better than the current code which definitely needs this cleanup.

The only thing which needs to be done here, other than resolving conflicts, is adding ChainGetPath to lotus-gateway, otherwise we'll break lotus-lite.

(also some nits / notes on changed behavior, but I'm fine with those)

itests/api_test.go Show resolved Hide resolved
chain/events/events_height.go Outdated Show resolved Hide resolved
confidence: confidence,
// Stash the tipset for future triggers.
for _, handler := range tipsets {
handler.ts = to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make the final call with a tipset that's potentially later than what the previous code would do (not AtOrAfter the specified height), but given how those tipsets are used, that's likely not an issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sure about that? This will stash the first non-null tipset after the target height. If there's a reorg, we'll set this to nil then re-set it to the new tipset.

chain/events/events_test.go Outdated Show resolved Hide resolved
@magik6k magik6k merged commit b0f57d7 into master Aug 31, 2021
@magik6k magik6k deleted the feat/refactor-events branch August 31, 2021 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants