Two (Radical?) Proposals for Clarity 3: Performance and Versioning #3772

kantai · 2023-06-29T19:25:35Z

kantai
Jun 29, 2023
Maintainer

I'd like to propose two major changes to the Clarity VM with the introduction of Clarity 3. The first of these is focused on performance improvements to the VM. The second of these, which is probably a necessary step toward the first, is focused on backwards-compatibility and safety.

Improving VM Performance

Fundamentally, the Clarity VM is an eval/apply mutually recursive interpreter. This is a very easy kind of interpreter to build and maintain. However, there are generally performance consequences to this kind of interpreter: the mutual recursion means that the VM isn't very stack friendly, there's costs associated with all of the necessary dispatching, and the interaction of recursion and the borrow checker means that the interface to the eval/apply require a fair amount of cloning. In addition, the VM execution exhibits very little spatial locality because each AST node could reside in a different heap allocated chunk. There's a good reason few production interpreters are implemented this way: it's slow.

To improve on Clarity's runtime, I'm proposing moving to bytecode interpretation. To be clear, contract publish transactions will continue to publish source code, and the source code would itself still be thought of as the consensus critical component: just as a bug today in the Clarity VM is a "consensus bug", so too would a bug in the bytecode compilation (and any bug in the subsequent bytecode execution).

This is not necessarily as dramatic of a change as it sounds. The byte code change could be as simple as a linearization of the AST, with the implementation impacting mostly the "special" Clarity functions (e.g., fold, filter, contract-call?) and, of course, the eval loop. In any event, the static analysis passes, which operate on the AST, could be largely unchanged.

However, this would also present a very good inflection point for deciding on targeting an existing bytecode language. Production of WASM or EVM compatible bytecode would present a clear pathway toward compatibility support for other blockchains.

I fully realize that this proposal is opening a big change in the Clarity VM, however, I think its worthwhile to engage with
this. Testing moderately complicated contracts such as https://github.com/hirosystems/stacks-pyth-bridge in Clarinet (where any MARF runtime overhead is eliminated, because it does not use a MARF) reveals function calls that take up to 800-900ms on modern hardware. While this may not be the major bottleneck right now (that is still MARF speed), it is likely to be the next bottleneck and if the Stacks blockchain is going to support anywhere near the speeds of other blockchains, the Clarity VM will need to be an order of magnitude faster (at least). The obvious path to that is bytecode execution.

Black-Box Versioning

The Clarity 2 <-> Clarity 1 upgrade occurred in Stacks 2.1. Stacks 2.1 continues to support both Clarity versions. The implementation of this upgrade involves the same Clarity VM executing both Clarity 1 and Clarity 2 contracts. This is a somewhat risky implementation strategy, because the Clarity 2 changes must be made in such a way that existing Clarity 1 contract behavior does not change. While this is risky, the risk in Clarity 2 was somewhat limited due to the more limited nature of the Clarity 2 changes. The type system remains largely the same, with some tweaks the trait invocation (which did create a consensus bug in Stacks 2.2, neccissitating a fix in Stacks 2.3); and beyond that, the changes were limited to new functions and variables.

If more changes are proposed, however, the risk dramatically increases of inadvertently breaking backwards compatibility, or of introducing new bugs due to esoteric implementation requirements of a single VM handling 3 different contract versions. For example, if the type system were to expand with new 256-bit integer types (or smaller integer types like 32/64), the impact on the codebase could be quite damaging: now we'd need to manage two incompatible type hierarchies, and check on every function call whether or not the current value matches the expected type hierarchy of the current contract version.

The alternative to this is a kind of black-box versioning. In a perfect world, the way this would work is that the blockstack_lib output in the stacks-blockchain repo would include two different clarity dependencies (e.g., "2.0.0" and "3.0.0"). The transaction handler would invoke the correct library depending on the version in a contract publish transaction, or based on the stored contract's version. There are some difficulties here. For example, contract-call? invocations would need to invoke a different VM. However, these difficulties would make the compatibility code explicit, and the boundaries between Clarity versions would therefore be well defined.

igorsyl · 2023-06-29T20:06:38Z

igorsyl
Jun 29, 2023

Improving VM Performance
I am highly supportive of this. We can learn a lot from the ICP blockchain.
Could we define a decidable subset of WASM opcodes for the Clartity Virtual Machine and then perhaps modify clang to be able to compile a decidable subset of Javascript into the decidable subset of WASM?

Black-Box Versioning
+1. This is the classical tradeoff between stuffing everything into a base class vs copying files and making small changes. No need to over-engineer for reusability given Clarity 1 will no longer change. We can snapshot code of older Clarity versions and freeze them in the code. The problem related to needing to use different VM's across contract calls is akin to inter-process communication which can could be fixed by defining a VM-agnostic message-passing protocol.

0 replies

jcnelson · 2023-06-30T04:35:49Z

jcnelson
Jun 30, 2023
Maintainer

Thanks for writing this up @kantai! A few thoughts below.

reveals function calls that take up to 800-900ms on modern hardware.

I just want to double-check this benchmark. It takes nearly a second to do a function call in the Clarity VM? As in, just calling a function in the same contract? Are there confounding factors in Clarinet that could be causing this, or are we certain that this is solely due to work done within the VM itself? Can you explain in more detail as to how this performance metric was obtained (i.e. where did you start/stop measuring, and what function(s) did you test)?

While this may not be the major bottleneck right now (that is still MARF speed)

I'm very skeptical that the MARF is the bottleneck for VM performance. I'll have more data next week with reproducible experiments, but some preliminary gatherings I made tonight suggest that MARF key/value-hash lookups are pretty fast (on the order of 10ms or lower).

What I suspect is a major cause of slowness in the VM's storage layer is the Clarity sqlite DB, for a few reasons:

The goodput is pretty bad in the analysis DB. We load the whole contract analysis to do any query against a contract, for example.
No work has been done to tweak the DB's page size to avoid internal fragmentation
The B+-tree in the Clarity DB grows with the total number of storage operations across all forks, which incurs a performance hit on each read. If we instead interleaved the Clarity DB data (i.e. the data whose hash is tracked by the MARF) within the MARF itself, i.e. right next to the MARFValue, then this penalty could be eliminated altogether.

so too would a bug in the bytecode compilation (and any bug in the subsequent bytecode execution).

What this effectively means is that the bytecode spec and interpreter implementation are consensus-critical code paths as well. I point this out because an alternative (but naive and brittle) approach could be to support multiple bytecode back-ends from a consensus-critical Clarity IR as a means of supporting EVM, WASM, and other execution environments. I just want to make sure everyone reading this understands that this is not going to be supported, and should not be supported, because there is exactly one correct sequence of state-transitions in the VM's memory and disk state for any piece of Clarity code. Making sure that multiple bytecode back-ends do this would be extremely difficult.

But that said, ...

However, this would also present a very good inflection point for deciding on targeting an existing bytecode language. Production of WASM or EVM compatible bytecode would present a clear pathway toward compatibility support for other blockchains.

Do we even care about EVM compatibility? We don't want TC smart contracts in the first place (so there's little gain in compatibility with existing EVM-compiled code or EVM-compiled languages), and the EVM isn't particularly well-designed to leverage real hardware (so it's not the performance win it could be). The former concern applies to WASM as well.

Also, the bytecode VM (whatever it happens to be) will be somewhat tightly integrated with the Clarity DB and burnchain DB, due to the fact that we have many Clarity built-ins that load data about the blockchain itself. I'm not sure how much using an off-the-shelf VM helps us here, since the integration work will likely be substantial.

Black-Box Versioning

I think this is great in principle, but the devil will be in the details. We'll need to think long and hard about how contract-call?s between Clarity 3 and Clarity 1/2 will work, especially since a Clarity 1/2 trait can be potentially implemented by a Clarity 3 contract (meaning, Clarity 1/2 code can call Clarity 3 code in addition to Clarity 3 code being able to call Clarity 1/2 code).

4 replies

kantai Jun 30, 2023
Maintainer Author

I just want to double-check this benchmark. It takes nearly a second to do a function call in the Clarity VM? As in, just calling a function in the same contract?

No -- this is the amount of time it takes to run the whole contract-call invoked by clarinet test in the linked repo. The contract is doing a bunch of stuff, but it's not a ridiculously complicated contract (it's a implementation of a price oracle standard). If you look at a flamegraph (which I can upload a little later) of the execution, the VM is spending time all over the place, and there isn't really a clear runtime culprit (though the VM does spend something like 10-20% of its time doing memory management: so clone() is a likely culprit there -- this is something that could possibly be avoided in recursive walks, but ultimately not easily).

I'm very skeptical that the MARF is the bottleneck for VM performance. I'll have more data next week with reproducible experiments, but some preliminary gatherings I made tonight suggest that MARF key/value-hash lookups are pretty fast (on the order of 10ms or lower).

I should have been more precise. It's not the bottleneck for VM performance exactly, rather it's still generally the cost limit that blocks reach first: the MARF read limit is 15K. At your benchmark numbers of 10ms, that implies a full block would spend ~2.5 minutes validating MARF reads, which seems like an appropriate amount of time for the limit. However, if blocks are going to contain more transactions, that means that the read limit must be higher. In order for the read limit to be higher, those operations must be faster than 10ms.

What this effectively means is that the bytecode spec and interpreter implementation are consensus-critical code paths as well. I point this out because an alternative (but naive and brittle) approach could be to support multiple bytecode back-ends from a consensus-critical Clarity IR as a means of supporting EVM, WASM, and other execution environments

Yes, I agree -- the byte code interpreter would be a consensus-critical codepath.

Do we even care about EVM compatibility?

I believe that there are ecosystem participants who definitely do, but I'm probably the wrong person to ask.

Also, the bytecode VM (whatever it happens to be) will be somewhat tightly integrated with the Clarity DB and burnchain DB, due to the fact that we have many Clarity built-ins that load data about the blockchain itself. I'm not sure how much using an off-the-shelf VM helps us here, since the integration work will likely be substantial.

That may be true, but even so, there's upside to using a standard byte code and integrating around it: you get a lot of developer tooling much more easily (debuggers, etc.), and as people work on things like ZK rollups, those systems would also be much more easy to support because at fundamental level, it would share a standard. Also, integrations with the clarity db and burnchain db probably wouldn't be as substantial as you might be thinking here: these integrations should really just be function calls from the perspective of the interpreter, so ideally there wouldn't actually be very much "tight" integration.

I think this is great in principle, but the devil will be in the details. We'll need to think long and hard about how contract-call?s between Clarity 3 and Clarity 1/2 will work, especially since a Clarity 1/2 trait can be potentially implemented by a Clarity 3 contract (meaning, Clarity 1/2 code can call Clarity 3 code in addition to Clarity 3 code being able to call Clarity 1/2 code).

Yes, but this is already the case -- the interface between Clarity 1 and 2 led to the Stacks 2.2 bug because the type conversion logic only engaged when the epoch was exactly equal to Stacks 2.1. If the interface actually had two different rust types, that could not have happened (because all Clarity1 type inputs would need to always be converted). Forcing the version interface to be explicit makes it clear exactly what that interface will be. It will allow testing of Clarity 3 to treat Clarity 3 as an independent unit, and the interactions with prior versions can be tested solely through testing the interface code: Clarity 3 can be guaranteed that its inputs are always validated with respect to its own type system.

jcnelson Jul 3, 2023
Maintainer

I should have been more precise. It's not the bottleneck for VM performance exactly, rather it's still generally the cost limit that blocks reach first: the MARF read limit is 15K. At your benchmark numbers of 10ms, that implies a full block would spend ~2.5 minutes validating MARF reads, which seems like an appropriate amount of time for the limit. However, if blocks are going to contain more transactions, that means that the read limit must be higher. In order for the read limit to be higher, those operations must be faster than 10ms.

I have some good news on this front: MARF reads are actually much, much faster than this (on the order of 25 microseconds). See here: #3777

kantai Jul 6, 2023
Maintainer Author

Great news!

That makes Clarity VM speed improvements even more important. If the MARF read limit can be set to ~150K, that means the Clarity runtime will be the bottleneck.

jcnelson Jul 6, 2023
Maintainer

I'm currently working on a PR that will interleave MARF-indexed data directly into the MARF itself, alongside the MARFValue leaf data. Based on the findings in that thread, that could easily 2x the speed further (possibly more). Once I have that ready, I'll profile block-processing end-to-end on the mainnet chainstate to see where the hotspots are.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two (Radical?) Proposals for Clarity 3: Performance and Versioning #3772

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Two (Radical?) Proposals for Clarity 3: Performance and Versioning #3772

kantai Jun 29, 2023 Maintainer

Improving VM Performance

Black-Box Versioning

Replies: 2 comments · 4 replies

igorsyl Jun 29, 2023

jcnelson Jun 30, 2023 Maintainer

kantai Jun 30, 2023 Maintainer Author

jcnelson Jul 3, 2023 Maintainer

kantai Jul 6, 2023 Maintainer Author

jcnelson Jul 6, 2023 Maintainer

kantai
Jun 29, 2023
Maintainer

Replies: 2 comments 4 replies

igorsyl
Jun 29, 2023

jcnelson
Jun 30, 2023
Maintainer

kantai Jun 30, 2023
Maintainer Author

jcnelson Jul 3, 2023
Maintainer

kantai Jul 6, 2023
Maintainer Author

jcnelson Jul 6, 2023
Maintainer