Add proposal for transactions v2 and address map program #17103

jstarry · 2021-05-07T13:08:49Z

Problem

Since each account input in a transaction uses up 32 bytes, the 1232 size limit for each transaction can be used up very quickly when composing programs in complex transactions. For example, many defi protocols interact with the serum dex and then those defi protocols are composed together for things like flash loans, arbitrage, pools, etc.

Keeping the 1232 byte limit for transaction is highly desirable because it simplifies and speeds up the networking / gossip layer used by validators.

Tracking issue: #17102

aeyakovenko · 2021-05-07T15:58:26Z

tag @taozhu-chicago we are definitely going to need a cost model to deal with 10x more reads and writes

jstarry · 2021-05-07T16:09:10Z

yup the cost model and transaction wide compute limit changes are both prerequisites to this being implemented

tao-stones · 2021-05-07T16:19:20Z

the cost model is currently part of banking_stage (#16694), I am in the process of moving it to bank.rs, so be able to run validator on mainnet to collect cost data.

jstarry · 2021-05-08T05:50:59Z

@aeyakovenko did you have a strong preference for the account key compression approach vs on-chain transaction construction? I personally prefer building up transactions on-chain since it's more generic and future-proof. It also provides a defacto solution for devs to store intermediate transaction context on-chain rather than requiring devs to implement that into their own protocols via multi-step flows.

I imagine the user / dev experience to be something like:

User signs a big transaction which:
a. designates a derived tx buffer account
b. encodes a nonce value (to prevent replays)
c. includes payment for transaction creator service (someone needs to pay fees for loading the transaction)
Transaction creator service then:
a. receives the large transaction from a user
b. creates the derived buffer account with a stored recent blockhash (for deterministic fees)
c. pays for constructing the transaction on-chain
d. executes the transaction (nonce is incremented)
e. receives payment

aeyakovenko · 2021-05-11T04:39:37Z

@jstarry the onchain way is going to be slower and more complex and require more changes to the runtime. We need to load the account prior to locking the accounts it references for execution.

Compression is purely at the transport layer and will benefit all txs. Personally I am leaning this way.

dafyddd · 2021-05-12T02:58:55Z

@jstarry the onchain way is going to be slower and more complex and require more changes to the runtime. We need to load the account prior to locking the accounts it references for execution.

Compression is purely at the transport layer and will benefit all txs. Personally I am leaning this way.

Do you have an idea of how many extra pubkeys we can send in a tx if you implement compression?

For our application, we're going ahead with splitting up our on chain function execution across multiple transactions due to the large number of accounts we have to access.

jstarry · 2021-05-12T03:06:13Z

Current plan is to use a u32 index for accounts so you would be able to fit in about 8x more than the current limit of ~35 accounts or use the savings to pass more instruction data.

Keep in mind this change is predicated on an improved transaction cost model that can handle/prohibit expensive transactions that take advantage of the wider constraints

aeyakovenko · 2021-05-12T16:52:10Z

@jackcmay can we get this into 1.7

jackcmay · 2021-05-12T18:23:53Z

The challenge of getting this into v1.7 is getting both the cost-model and the transaction-wide caps in as prerequisites. Looks like the cost-model work could go in pretty soon given it's limited fix-cost based scope. The transaction-wide-cap stuff is still in the design phase and will require some coordination with partners (raydium at least) (unless we white list them up front).

ryoqun · 2021-05-13T08:09:45Z

docs/src/proposals/big-transactions.md

+Solana can provide a way to map commonly used addresses to small integer values.
+This mapping can be computed on epoch boundaries by all validators or registered
+on-chain in state managed by a new on-chain program.


I think this pre-registration thing would be a bit of awkward. Also think about some urgent fix needing another registration and waiting for the next epoch, assuming this is high-level defi, where effectively shortened account keys is kind of mandatory for functioning. Sounds a disaster?

Also, the integer values are precious assets by itself. So, we need auction or correct pricing. Otherwise, people (or hackers) are rushed to squat at them?

How about doing some git-like prefix disambiguate thing?

Make new tx serialization variant so that all accounts (except system account and signer account) can be referenced by variable length prefix for its pubkeys (system account can't be referenced to avoid squatting-attack at same prefix)

good: generally simpler design, no pre-registration, no transitions of tx formatting for epoch-boundaries

good: fortunately, our key space is well distributed by sha256 and ed25519.

bad: reduced size efficiency: maybe 64bits will be needed for practical reasons.

bad: given tx can fail due to ambiguous account prefix depending on account state at the time, so tx sim is kind of requisite.

bad: we're more reliant on BTreeMap, which is kind of hard to optimize.

@ryoqun that is a cool idea but I'm wondering how we prevent ambiguous accounts if anyone can create accounts at arbitrary addresses by transferring lamports. Are you thinking of additional constraints on the account disambiguation like must be owned by some program or be a PDA?

system account can't be referenced to avoid squatting-attack at same prefix

@jstarry that's exactly the reason I excluded any system accounts at all (those simply created by transferring arbitrary addresses). :) My observations is that defi doesn't need to reference to system accounts so often unless signature check (in that case, we can't use shorthand notation unless breaking our sigcheck pipelining.. xD)

Ah that makes perfect sense, thanks for the explanation. I really like the simplicity of this approach.

Also, 64 bit prefix sounds reasonably high.. that's the equivalent of grinding over 10 base58 characters right? 2^64 (1.8e19) ~ 58^11 (2.5e19). But at some point it may be economically worth it to DoS an on-chain protocol by grinding an ambiguous key using cheap compute in the future with some crazy efficient gpu.

Do you have any ideas around future-proofing this approach @ryoqun? The prefix length could be flexible which would allow protocols to adjust to longer prefixes. We could also have a first come first served model for ambiguous accounts to have a strict ordering.

@jstarry well I haven't come up with nice protection. that's another bad: point... xD

Hopefully, the dos risk should be really low risk. That's because clients just notice some ambiguity at tx sim from rpc after the attacker indeed mounted the squatting attack. Then, client will add another byte to the prefix (then it'll become 9 byte) and resubmit the tx, then done. On the other hand, this leads the bad: tx sim is required. attacker must spend 256x efforts. So, this is the reason I made prefix variable and mentioned git like :)

So, after spending sizable power/cost for grinding, attacker can only Dos for a slot or two of duration.

jstarry · 2021-05-13T11:26:56Z

I think we could do pre-registration without a limited index space by creating an account index program which allows anyone to create "index accounts" which hold some number of addresses. I imagine the new transaction type to look something like:

message.account_keys: (no change) list of addresses
message.index_keys: (new) list of integer pairs A.n where A is an account_keys index and n is the nth address stored in account_keys[A] account.
message.instruction.accounts: indexes into either account_keys or index_keys depending on if the index is greater than account_keys.len() - 1

Protocol authors can pre-create indexes at derived addresses for easy lookup. Transactions typically need the same group of account each time a particular market is used, so this grouping of addresses feels natural.

This would allow anyone to create their own index accounts if needed so there's no risk of denial of service on epoch boundaries. The index accounts would be time-locked so that they couldn't be recreated multiple times within an epoch too.

The runtime dynamically loads the extra accounts from index accounts just like the upgradeable loader dynamically loads the program data account

ryoqun · 2021-05-13T12:54:03Z

I think we could do pre-registration without a limited index space by creating an account index program which allows anyone to create "index accounts" which hold some number of addresses.

cool. this sounds workable. So, this is like page table entries from the memory paging. :)

aeyakovenko · 2021-05-14T11:46:24Z

Yea. I can’t decide if we should have user defined dictionaries or just index all the pubkeys every epoch.

aeyakovenko · 2021-05-15T16:39:31Z

With a registered dict, the keys could be u16 indexed. I don't think a single dictionary needs to exceed 64kb

jstarry · 2021-05-21T22:28:12Z

@aeyakovenko @jackcmay I fleshed out the proposal with the design of an on-chain account indexing program as well as new transaction format which can store up to 256 account inputs. I believe this approach will be the most flexible going forward. I'd really appreciate some feedback, thanks!

docs/src/proposals/big-transactions.md

jon-chuang · 2021-06-09T04:15:36Z

@aeyakovenko after some analysis, I agree.

The issue is not with RAM residency, but further pipelining the account loads, switching from MMap to application-managed disk I/O to reduce tail-end account load latency and offer better thread utilisation, and restructuring some of the accountsdb calls to be batched and run async under the hood.

Note that the tail-end latency of tx batch (which batch-locks accounts) is very very critical for linear state updates in blockchain, which the current method of sequentially triggering at most 2560-32768 disk I/O ops per batch relying on kernel-side page-faulting black box with zero application-side control (that then introduces challenges around scheduling helper I/O threads) fails miserably to guarantee.

I've tried to "naively" parallelise the loading #17774 but it's hard to schedule threads in a lightweight manner when you don't know if your thread is going to hit RAM or stall on a black-box blocking kernel-side exception into interrupt. Parallelising naively leads to far worse average-case performance and unpredictability due to the OS thread and work-stealing scheduler.

The other benefits of better leveraging NVME IOPs is reducing RAM usage while achieving better performance.

Which is why the persisted index, I believe, ought to avoid MMap at all cost - mmap needs very high RAM to disk ratio (or basically 1:1) to mitigate tail-end latencies. Hiding latencies requires either async or batching. I prefer batching to avoid propagating async to other parts of the code, which require conversion of & refs to Rc or Arc (although this is not super terrible if you're only passing a few "global" objects with long lifetimes).

I'm coming up with a report on this issue, giving some recommendations, the rationale, analysis, including a study of existing high performance DBs like ScyllaDB, and potentially the use of non-blocking kernel IO like io_uring.

The proposal involves replaceing MMap in AppendVec, and making a bigger, metadata-aware (priority, data-size, is_executable) application-managed cache that scales to available RAM and can relinquish RAM in times of memory pressure (by quering via sysinfo). The application leaves a buffer of 20% of RAM free, and evicts or grows cache capacity whenever get_available_memory/get_total_memory deviates from 20%.

Preferably, what we'd recommend to validators is to set swapoff. Because hard fault is never good for performance.

CC accountsdb folks: @sakridge @jeffwashington @carllin

docs/src/proposals/transactions-v2.md

jstarry · 2021-06-10T01:13:12Z

docs/src/proposals/transactions-v2.md

+// Uses custom deserialization. If the first bit is not set, deserialize as the
+// original `Message` format. Otherwise deserialize as a versioned message.
+pub enum Message {
+  Unversioned(UnversionedMessage),
+  Versioned(VersionedMessage),
+}


@garious please let me know your thoughts on this serialization scheme. Rather than be stingy on bytes, I propose keeping custom deserialization to a minimum and removing the need for custom serialization.

Since bincode doesn't support custom enum discriminants, I think that we can sacrifice the first byte for versioned messages (represented below by the prefix field) and use off-the-shelf ser/de for VersionedMessage below. This way, both UnversionedMessage (the currently used Message format) and VersionedMessage can both be serialized directly to raw bytes without any custom handling (custom handling would be required if we tried to encode a version in the top 4 bits of the first message byte, for example).

garious

Looks good to me. Looking forward to seeing how the caching semantics change as this is prototyped. It'd be nice if that activation data can somehow be tucked under the hood.

jstarry · 2021-06-11T00:37:56Z

Thanks for the discussion everyone, it's time for implementation to begin. Feel free to add more comments here or create issues for any additional concerns.

jackcmay · 2021-06-11T18:29:46Z

docs/src/proposals/transactions-v2.md

+Using an address map in a transaction should incur an extra cost due to
+the extra work validators need to do to load and cache them.
+
+#### Metadata changes


The internal format of a transaction does not have to match the OTA format. Incoming transactions could be converted to a format that is consolidated (no extra map account data needed) and more naturally handled and stored by the runtime. One option to minimize runtime changes could be to convert the v2 tx + maps to a v1 tx. The runtime would probably need to change very little if this conversion could be done early enough. Or maybe create a new internal evolving trait based tx format that v1, v2, vx gets converted into.

Very true. With multiple formats there's the question about which bytes are actually going to be signed and verified from the ledger:

If the bytes of the OTA format are signed to produce the tx signature, then we need to make sure that validators also receive that OTA format in the ledger blocks so that they can verify the integrity of transactions. The consolidated transaction info isn't broadcasted, it's always derived.

If the bytes of the consolidated format (all maps resolved) are signed, then the block producers have to do an extra step of transaction expansion before doing signature verification. But then the consolidated transaction bytes can be stored in the ledger directly and validated more easily.

The current proposal is to do 1) and then both block producers and validators need to expand the OTA format themselves to get the consolidated format which the runtime can handle. By storing the extra metadata, RPC nodes will also know how to expand those transactions before sending to clients. We already do this for logs, inner instructions, and balances.

If we go with 2) instead, we don't need any of that extra metadata and validators don't need to expand transactions but this would also introduce an extra step before signature verification for expanding incoming transactions. This might be ok since all address maps would be cached. It would also allow us to reference signer addresses with address maps since they would be expanded before signature verification.

@sakridge does this framing make sense and do you prefer one approach over the other?

jackcmay · 2021-06-11T18:52:54Z

One of the short comings of the original tx format was that it wasn't able to explicitly describe CPI dependencies. The result is the need to pass ALL the accounts and programs needed by an instruction execution chain to the top level program. Doing so creates a lot more overhead since all the accounts needed below have to serialized/mapped into the program space of each caller. Most accounts probably don't have to be passed in in their entirety or at least their data does not. If the tx included a dependent account field then the runtime could keep those accounts in the wings and reference them when needed without requiring the programs to ferry them all along. We should consider this now that we are bumping the tx format.

ryoqun · 2021-06-12T13:39:30Z

One of the short comings of the original tx format was that it wasn't able to explicitly describe CPI dependencies. The result is the need to pass ALL the accounts and programs needed by an instruction execution chain to the top level program. Doing so creates a lot more overhead since all the accounts needed below have to serialized/mapped into the program space of each caller. Most accounts probably don't have to be passed in in their entirety or at least their data does not. If the tx included a dependent account field then the runtime could keep those accounts in the wings and reference them when needed without requiring the programs to ferry them all along. We should consider this now that we are bumping the tx format.

@jackcmay oh, it seems that this problem is can be solved by #17796 as another selling point

the proposal introduces new syscall called access_as_unsigned(pubkey, flags), so that any child (CPI-ed) instruction processors can load accounts on demand like sysvars() as long as referenced by tx by v1 or v2 way. This means we don't need to pass all the accounts to the entrypoint and we can avoid the overhead :)

jackcmay · 2021-06-14T06:45:16Z

One of the short comings of the original tx format was that it wasn't able to explicitly describe CPI dependencies. The result is the need to pass ALL the accounts and programs needed by an instruction execution chain to the top level program. Doing so creates a lot more overhead since all the accounts needed below have to serialized/mapped into the program space of each caller. Most accounts probably don't have to be passed in in their entirety or at least their data does not. If the tx included a dependent account field then the runtime could keep those accounts in the wings and reference them when needed without requiring the programs to ferry them all along. We should consider this now that we are bumping the tx format.

@jackcmay oh, it seems that this problem is can be solved by #17796 as another selling point

the proposal introduces new syscall called access_as_unsigned(pubkey, flags), so that any child (CPI-ed) instruction processors can load accounts on demand like sysvars() as long as referenced by tx by v1 or v2 way. This means we don't need to pass all the accounts to the entrypoint and we can avoid the overhead :)

I think #17796 partially addresses the point I'm bringing up. The dynamic discovery of dependent accounts during simulation presented by #17796 and the explicit specification of dependent accounts in the tx together are required. In simulation the dependent account list will be populated, in non-simulation, it will be read from the tx.

jon-chuang · 2021-07-01T19:03:29Z

Hello @jstarry , minor nit, but I've just read that the way bpf_loader serializes KeyedAccount is by writing the position as

if is_dup {
    v.write_u8(position as u8)
          .map_err(|_| InstructionError::InvalidArgument)?;
} else {
    v.write_u8(std::u8::MAX)

So maybe 255 accounts is the max, not 256, as u8::MAX is reserved for non-duplicate accts

jstarry · 2021-07-01T19:08:04Z

So maybe 255 accounts is the max, not 256, as u8::MAX is reserved for non-duplicate accts

Thanks @jon-chuang!

jon-chuang · 2021-07-01T19:47:15Z

Actually, reading the logic more carefully, it actually doesn't matter - if you have max 256 accounts, then is_dup will never return a position = u8::MAX as usize. So just ignore this.

* Add proposal for supporting big transactions * account index program * fix formatting * review feedback * Add cost changes section * Add cost section and more attack details * fix lint * document metadata changes * nit * rpc details * add index meta struct * add additional proposal and chagne title * rename proposal file * rename to address map and rewrite tx format * no more appends, limit mapping size to 256 * update dos section * add note about readonly * restructure message to use enum * cleanup (cherry picked from commit 1915191)

…8376) * Add proposal for supporting big transactions * account index program * fix formatting * review feedback * Add cost changes section * Add cost section and more attack details * fix lint * document metadata changes * nit * rpc details * add index meta struct * add additional proposal and chagne title * rename proposal file * rename to address map and rewrite tx format * no more appends, limit mapping size to 256 * update dos section * add note about readonly * restructure message to use enum * cleanup (cherry picked from commit 1915191) Co-authored-by: Justin Starry <justin@solana.com>

jstarry requested review from sakridge, aeyakovenko, mvines and jackcmay May 7, 2021 13:08

ryoqun reviewed May 13, 2021

View reviewed changes

jstarry added 2 commits May 21, 2021 06:44

Add proposal for supporting big transactions

ae45cd4

account index program

9a5dabf

jstarry force-pushed the proposal/big-tx branch from 2c7bbfd to 9a5dabf Compare May 21, 2021 22:25

jstarry marked this pull request as ready for review May 21, 2021 22:25

fix formatting

006608e

t-nelson reviewed May 22, 2021

View reviewed changes