-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Solana BPF v2 #20323
Comments
I have been thinking, that instead of adding a few new instructions and removing some others from BPF, it might be better to go with a portable-code approach directly. We are breaking compatibility with BPF in a major anyway and we want to support other native architectures besides x86-64, such as ARMv8 AArch64 and maybe SPIR-V. These would require an entire data flow analysis and register allocation at least, because they are so different from x86-64 in their data-flow style and would not work as well as the current BPF to x86-64 conversion. Furthermore, we could think a lot more about the final conversion steps and e.g. do things which BPF is not meant for like:
|
What does this mean in practical terms? What other changes to the BPF instruction set would you suggest (for now considering only supporting non-x86 ISAs)? Maybe some BPF instructions do not map directly to ARM instructions, but is it so radical that we need to redefine the BPF instruction set completely? |
For the current BPF to x86-64 conversion it is really hard to do any optimizations, because most rely on static analysis that would not be necessary if that information was not thrown away by the previous compilation step. For ARMv8 AArch64: It has almost double the registers (31) that x86-64 (16) and BPF (11) have. So not to waste them requires an entire reallocation of the registers and reorganization of the stack spilling. Again requiring a data flow and control flow analysis beforehand. All of this could also be skipped if we would not compile down to a specific register set in the first place. For SPIR-V: The entire architecture works completely different. While you still program on what is seemingly a single thread, it is executed in bundles of 32 or 64 threads in lockstep. So control-flow-divergence has to be minimized. Also there are much more registers (thousands) for cheap context switching, but almost no cache in turn. Data is best touched only once. |
What I suggest is not to fork & modify BPF any further but use something else entirely. |
We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course. I'm not sure it's possible to come up with an instruction set that would map to all real machine instruction sets equally well. Java bytecode is not quite a good example, and optimizing it works well for long running server applications. SPIR-V also requires a different programming model. It seems one cannot take an arbitrary rust program and compile it for execution on SPIR-V. |
I agree this is indeed something to think about. |
Definitely, while there have been some funny trials such as rust-gpu, we would really need to think the "more parallelism" thing through from end-to-end throughout the entire ecosystem, not just the VM and ISA.
You mean as a form of SSA, right?
That is the reason why I mentioned WASM. It has its shortcomings, but it works well over a wide range of hardware architectures. |
Yes, this essentially echoes your suggestion of staying in SSA. By stack machine you mean we would redefine BPF abstract machine to be a stack machine the same as Java VM? I guess it's a possibility, but then we really need a good run-time binary translator (in rbpf) to optimize the code. I'm not sure why you're so concerned with instruction encoding -- is it to reduce the code size of on-chain programs? |
Yes, I think the BPF 8-byte fixed instruction size is terribly wasteful. It might not be a concern for the Linux kernel, for us however, program size is limited by account size.
The queue (FIFO) variant might also do the trick and be more straight forward as the instruction schedule can stay in the same order. Let's say we have a SSA form like this: ...
%3824 = add %3818, %3821
%3825 = add %3815, %3823
%3826 = mul %3824, %3825 To convert to the queue machine representation you simply:
...
add %5, %2
add %9, %1
mul %1, %0 Not only does it need no destination operands, but also the numeric value of the source operands is bounded. |
There is room for both here, I think we should formally move away from "BPF" anyway. In the shorter term, the proposed changes described in this issue's description can provide benefits as well as pave the way for a more flexible evolution path for other SBF formats. Aka, SBFvX doesn't have to be anything like SBFv1 or 2. |
@dmakarov How long do you think the last 4 items in the issue description would take to implement? |
The support for signed division I think we already have enabled in the llvm backend (i'll have to double check). I started implementing the variable stack size and I believe the remaining work is in RBPF only. The same as now the programs do not manage the stack pointer, but RBPF does, I want to make RBPF manage both frame pointer and stack pointers. On the compiler side there will be only one additional instruction generated in function prologue and epilogue, that communicates the stack size to RBPF. The RBPF at run-time updates its internal data structures managing the frame and stack pointers, without exposing this to the executing program. Both frame and stack pointer are need to know the boundaries of the variable stack frame at all times. I estimate I would need another week to finish this work (but not this current week probably). |
The two other items input buffers and callx restriction I haven’t looked at yet. I need to understand better what it would take. Restricting callx to known targets is probably simple. What to do if the target is not known? Issue an error? Removing input buffers instructions also does not take much time, but what should be instead of these instructions? Are they not used at all? Then why is it necessary to remove them? Maybe we can discuss this in more detail? |
Removing those instructions cleans up the instruction set, turns them into invalid instructions, and removes the burden on the runtime to support these instructions. |
Register allocation is expensive to do, and that's one of the reasons why wasm isn't so great. I think we should tread carefully there. The other way round, from BPF registers to virtual registers/ssa form is not so expensive. A single pass can create SSA form, and it's not expensive. |
That depends on the quality you want. But for ahead-of-time, maximum performance I agree. Then again, in that case, we don't need to compile often so it can take more time.
We have that implemented here. Still does not solve the miss match in the number of registers between different native target architectures. |
My original response was to the problem of supporting multiple architectures with various number of available registers (x86, ARM, etc). Excuse me, but I don't quite understand the purpose of creating SSA form at run-time other than for simplifying JIT optimizations. It seems at that point we still would need to do the register allocation for the specific hardware registers of the machine the RBPF is running on. If this is the case, why add an extra step of converting limited number of BPF registers to SSA form at run-time? |
I assume the SSA form is needed to support spir-v, that's all. In order to justify doing register allocation at runtime (with the extra overhead), there would have to be sufficient register spilling in BPF code. I haven't seen this in the BPF code I've seen. |
It is already buried in the depths of this conversation:
The reason for register allocation is not spilling but simply utilizing all the registers ARM has, as we want to target that in the future as well. |
If the BPF code is not spilling, it has no need for extra registers. |
Do you have a representative body of BPF programs to analyze how much register pressure we currently have? Even then, future programs could change everything. |
I've been staring a lot at generated BPF code and I haven't seen a lot of spilling. The eBPF was designed explicitly to avoid the runtime register-allocation problem; changing this is should not be based on a hunch. |
Keep in mind that ARM is RISC, it has no memory operands unlike x86-64.
That is certainly true. And I think we need a body of representative BPF programs to benchmark and analyze anyway if we want to design driven by our needs. |
It just occured to me that signed integer division would introduce another edge case: |
Absolutely. Now the code emitter needs two compares and branches before emitting There is possibly a case to be made that the bpf code can deal with signed div like it has to do now. |
We should try to get the linker to concatenate similar sections (such as all read-only sections) and place them consecutively in virtual address space, so that we don't need to allocate, copy and fill the gaps with zeros here: https://github.com/solana-labs/rbpf/blob/cd37c0bf61dcce22fa48b594005237eeb130af76/src/elf.rs#L430. |
Very interesting discussion. Any update on that? Does anyone know the reason why the initial Solana design ended up being based on eBPF rather than WASM, given the fact that there are a bunch of other, pre-existing smart contract chains that are WASM-based. Was it just the Solana founders' familiarity with eBPF? |
|
I finished the task "Restrict the callx jump targets to only allow known symbols": solana-labs/rbpf#397 However, there are some new ideas what to add to SBFv2:
|
In the last two weeks I spent some time refactoring, cleaning up and testing SBPFv2 in RBPF. During that I discovered multiple issues that we should consider changing. Proposed Solution
That way the
|
@Lichtso All of the above sounds great if we stick with ELF. I had a few thoughts as well:
This forces a sequential dependency on a memory load for every function call (to look up hash=>destination). Since function calls are quite common, this may worsen performance and trash caches. Instead, I would go the other direction:
Can you be more specific how this would work? Does the name field point to a hex representation of the hash? If both name and hash are specified, what is the behavior when the hash does not match the name?
Perhaps we can save some performance using hex instead. A quick benchmark shows that the reference base58_decode_32 algorithm (used in bs58) takes about 1079ns, while fd_base58 takes 123ns. Not the end of the world, I guess. Let's use this opportunity to specify strict validation rules for the number of entries of the Generally though, I would prefer a new Solana-specific object format instead of working around limitations in ELF. ELF lacks the features we need. I realize that the additional work required to make such a format may be impractical, but I want to at least give it a try. |
Good point, at least for interpreters I guess.
We already do distinguish between what you refer to as call and farcall with the There is an 58 bit unified address space for all functions:
We could either encode the
Hex is fine too, base58 is just a bit nicer for debugging but probably not worth it.
I think the consensus with our team ATM is that we stick with ELF to keep the tooling reach, but restrict it down as much as possible for ease of parsing. |
@Lichtso Mocking a flat address space is an interesting solution. I wasn't aware that code might need to indirectly call into another object, so that makes sense. That considered, I agree with having both call+callx opcodes handle "near" and "far" calls, and prefer this solution over a mix of hashes and relative addresses.
Using a hash as the lower bits of an address seems a bit exotic ... But not all that new in eBPF land, where low addresses map to kernel function calls. The remaining issue is the use of binary/opaque identifiers where ELF only supports strings. Ideally, we would find a solution that works the same for symbol hashes and program addresses. Seems like we could either (1) specify a string encoding and use the ELF standard structures, or (2) roll our own custom dynamic imports table and call destination table. I begrudgingly prefer 1. |
@alessandrod Pointed out to me, that we don't have to hack the symbol hashes into the symbol table or symbol strings directly. Instead we can use the Also, instead of using hash keys in the 32 LSBs of a function pointer, we could use the index into the symbol table. This has the tremendous advantage that without hashing, there also can't be any collisions, which would otherwise be a problem at runtime in the function pointers. An implementation could either lookup the target IP/PC in the symbol table of the selected dependency or allocate and initialize a PLT (procedure linking table) at the beginning of a transaction and do lazy loading (doing the lookup once and then patching the PLT so the next time the same entry is used it becomes a single indirect call). This approach would mandate that the order of exported symbols in a program stays stable over the course of recompilations / redeployments. During redeployment we have to compare the replacement of a program against its previous version anyway in order to make sure that the existing function signatures stay the same. Thus, this step could also be used to reorder the symbol table accordingly. |
Problem
Solana Bytecode Format is a variation of eBPF. Some of the architectural requirements are different from eBPF and what Solana needs. A new version 2 is needed to move away from BPF and meet the specific needs of Solana
Proposed Solution
call
are valid destinations forcallx
. (Refactor - Restrictcallx
to registered functions rbpf#397 and Refactor - Unifycall
andcallx
error handlers in JIT rbpf#439)lddw
instruction (Fix - Makes functions starting in the middle oflddw
fail verification in SBFv2 rbpf#440)call
/callx
can reach other functions. (Verifier - Stricter control flow rbpf#454)cargo-build-sbf
cargo build-sbf --arch=sbfv2
From a documentation perspective SBFv2 will still be referred to as "SBF" and when calling tools like
cargo build-sbf
a default arch will be provided, v1 initially, then a switch over to v2.The text was updated successfully, but these errors were encountered: