-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor consensus engine handling as well as clique internals #1899
Refactor consensus engine handling as well as clique internals #1899
Conversation
@@ -59,27 +58,28 @@ class SnapshotManager: | |||
""" | |||
|
|||
logger = get_extended_debug_logger('eth.consensus.clique.snapshot_manager.SnapshotManager') | |||
# TODO: How do we get around this? It is way too slow without cached snapshots. | |||
_snapshots = lru.LRU(IN_MEMORY_SNAPSHOTS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carver I've hit another road block here with the stateless ConsensusAPI
concept. I'm sorry I didn't notice this earlier. Here it goes:
So, we were able to get rid of the HeaderCache
because that could be replaced by going from validate_seal(header)
to validate_seal_extension(header, parents)
. I should also note that other clients don't necessarily have a header cache in their clique implementation and that this was just me trying to make things work with just validate_seal
that only takes a single header.
But the snapshot cache is something different. Every Clique implementation out there has this (afaik) and I believe it is at the very heart of clique so I'm not sure how we would get around it.
It basically means: If we don't cache snapshots then with every single header that we validate, we will have to load all previous header from the database until we hit the most recent checkpoint header (every 30000 header for Görli). As you can imagine, this is super super slow.
As a band-aid fix I moved this to the class level but that obviously isn't an acceptable solution. I can't help myself but think we have to move at least parts this back to the chain level unless you have other clever ideas that I might not be seeing.
I want to note that if you feel like playing with this (you can run it with ethereum/trinity#1196 as simple as trinity --goerli --disable-tx-pool
) then I'm happy if you take this in any direction that gets us closer to a merge. I'm not trying to offload this to you, just saying that if you have ideas then you should not feel obligated to discuss them first if you think you could just take a stab on it and finish it up. I basically just want to see this wrapped up and out of my way 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. Silly question: what if we persisted every snapshot?
An alternative that comes to mind is having the chain pass in a volatile data store. Much like how it passes in the DB, but with the expectation that everything in the volatile store will be dropped on shutdown. (Also, that the VMs need to manage cross-fork incompatibility across the volatile store themselves, like they need to do with the DB already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we persisted every snapshot?
Multiple problems I think:
- Lot's of overhead in snapshot creation
- Still one db lookup per block to get the snapshot
- Lots of wasted disk space that we need to clean up again
Much like how it passes in the DB, but with the expectation that everything in the volatile store will be dropped on shutdown.
Yes, dropping the data on shutdown won't be a problem I think even dropping them between forks wouldn't be a problem at all. It would just be one header after the fork that causes us to suck in 30000 headers from the db until we are back to normal.
So, yeah, that would work but I think it can not be just some predefined storage because e.g. in this case we want to setup an LRU cache for the snapshots and in other cases we might needs some different kind of storage or eviction polity.
But actually, this brought me to some idea and I took a stab at it and it seems to work. Here's a rough outline and then I continue with some inline comments.
Instead of the ConsensusAPI
taking in the db, it will take in a ConsensusContextAPI
.
class ConsensusAPI(ABC):
@abstractmethod
def __init__(self, context: ConsensusContextAPI) -> None:
"""
Initialize the consensus api.
"""
...
We set the consensus_context_class
on the chain. And just like the ChainDB
the consensus_context
is set once when the chain is initialized.
class Chain(BaseChain):
...
chaindb_class: Type[ChainDatabaseAPI] = ChainDB
consensus_context_class: Type[ConsensusContextAPI] = ConsensusContext
def __init__(self, base_db: AtomicDatabaseAPI) -> None:
...
self.chaindb = self.get_chaindb_class()(base_db)
self.consensus_context = self.consensus_context_class(self.chaindb.db)
self.headerdb = HeaderDB(base_db)
Then later, in get_vm(at_header)
that context is being used.
def get_vm(self, at_header: BlockHeaderAPI = None) -> VirtualMachineAPI:
header = self.ensure_header(at_header)
vm_class = self.get_vm_class_for_block_number(header.block_number)
chain_context = ChainContext(self.chain_id)
return vm_class(
header=header,
chaindb=self.chaindb,
chain_context=chain_context,
consensus_context=self.consensus_context
)
So, the bottom line is:
- we continue to have VM instances that do only operate on the state that they get passed in
- we have consensus context state that is per-chain rather than per-vm
- we do not have any state leaking (as in: two different chain instances accidentially sharing consensus state).
There's a lot to clean up here but you can take a look at the most recent commit which implements this and also another one on the Trinity side that takes it on a ride. It works and I think it kinds gives us the best of both worlds.
I have a good feeling with that, hoping you feel the same vibe 😅
If you are 👍 on that direction than I can clean up the mess and propose a clean PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this feels like a good direction 👍 🎉
I also like the idea that maybe we can enforce the reset of the context across fork boundaries. For example, we might create an independent consensus_context for each distinct VM, so in get_vm()
instead of just getting self.consensus_context
we get self.get_consensus_context(vm_class)
which is generated on demand, but only has one per VM.
PoW could use this too! There is an epoch cache that it doesn't want to regenerate every time, and I think it does something hacky to save them (like keep a global cache 🙈 ). Obviously that change doesn't need to go in here, but a tracking issue would be cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PoW could use this too! There is an epoch cache that it doesn't want to regenerate every time, and I think it does something hacky to save them
Ah yes! We can use that to clean this up :)
I also like the idea that maybe we can enforce the reset of the context across fork boundaries
Yeah, we could do that if we need it but I think we should have a good reason to do that. Like in, the Clique case it would just cause us to take more time to validate the header after each fork transition while not providing any actual benefit. But sure there could be other consensus schemes that would need a different context per fork.
Ok, I'm gonna start getting this into shape now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking issue: #1900
ac1b2ba
to
2bb85cd
Compare
79dd2b9
to
1d2647a
Compare
@carver This can be given a new review now. For a second I thought we could actually roll back to having |
@abstractmethod | ||
def __init__(self, db: AtomicDatabaseAPI) -> None: | ||
""" | ||
Initialize the context with a database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple random thoughts, as I won't be able to pick up the review again until later:
I was assuming we would formalize the idea that this context can't write to the database (Maybe this should be a read-only version that's supplied?) Or if we don't then maybe it fits as a kind of parallel to ChainDB
, like ConsensusDB
.
"remains static" in the docs above didn't immediately mean to me what I think we want to say. Something like: this instance stays in memory across VM runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming we would formalize the idea that this context can't write to the database
But, we need to write to the database. E.g. Clique persists snapshots to the db.
db[key] = encode_snapshot(snapshot) |
Or if we don't then maybe it fits as a kind of parallel to
ChainDB
, likeConsensusDB
I'm not sure how that would work out. In the ChainDB
case, we know the specific types that make a chain and that can be stored and retrieved. But for consensus, it seems pretty much up to the specific consensus algorithm which kind of things need persistence. Clique uses snapshots but it's hard to imagine all the different things other consensus schemes would require.
"remains static" in the docs above didn't immediately mean to me what I think we want to say. Something like: this instance stays in memory across VM runs.
👍 I'll update that.
9a13131
to
4ca1b08
Compare
4ca1b08
to
b9b0908
Compare
@carver just a reminder that this still needs to be reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a full passthrough on everything but test_clique_consensus.py
, which I kind of skimmed. There are a couple of places where I wish we had a better solution. Like validate_header
's check_seal=True
is the only reason we need to make it an instance method. Changing that requires a lot of churn, and isn't a clear enough win for me to recommend right now, and I don't have any suggestions... So whenever you're satisfied, it's time to 🚢 !
""" | ||
Validate the seal on the given header by checking the proof of work. | ||
""" | ||
check_pow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If there isn't one yet) Could you add a to-do issue for moving the pow cache into a consensus context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have that here #1900
chain_class_without_seal_validation.vm_configuration # type: ignore | ||
), | ||
) | ||
return chain_class.configure(vm_configuration=no_pow_vms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love how simple/readable this got with the new arch.
What was wrong?
This is another attempt to refactor Clique based on #1875 and what @carver and I discussed offline.
There's also a Trinity PR that can sync Görli with this: ethereum/trinity#1196
How was it fixed?
validate_seal
andvalidate_header
are now instance methods. The only reason they can be classmethods today is because our Pow implementation relies on a globally shared cache which should be refactored as described in Use ConsensusContextAPI to get rid of global cache currently used in PoW implementation. #1900There are two new methods:
chain.validate_chain_extension(header, parents)
andvm.validate_seal_extension
. They are to perform extension seal checks to support consensus schemes where headers can not be checked if parents are missing.The consensus mechanism is now abstracted via
ConsensusAPI
andConsensusContextAPI
. VMs instantiate a consensus api based on the setconsensus_class
and pass it a context which they receive from the chain upon instantiation. The chain instantiates the consensus context api based on theconsensus_context_class
.To-Do
Cute Animal Picture