Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low IOPS during phase 5 due to synchronous reads #1516

Closed
aasseman opened this issue Feb 25, 2021 · 20 comments
Closed

Low IOPS during phase 5 due to synchronous reads #1516

aasseman opened this issue Feb 25, 2021 · 20 comments

Comments

@aasseman
Copy link

Hi!

TG, during the execution phase, seems to do all of its I/O synchronously (ie. one by one), which prevents the OS and disk to optimize seek times and reduce overall random I/O latency.

The issue is particularly visible when using a cloud instance, where generally disks are actually network block storage, which increases latency a lot even when requesting SSDs (high random IOPS, but still high latency because of the network layer).
Check this documentation for example: https://cloud.google.com/compute/docs/disks/performance. Note that it still applies on bare metal, where latency still exists.

The solution (if the logic permits) would be to have multiple light threads requesting (prefetching) the different pieces of data simultaneously, such that the OS I/O queue could optimize the disk access.

I am running the execution step right now locally on an HDD, and I am getting a throughput of ~100kB/s, read IOPS ~27, and an IO queue depth of ~1.9. Previous steps were able to fill the IO queue to ~128. (Measured through iostat -xt 10 on a drive used exclusively for TG's data)

@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Feb 25, 2021

Hi. Here is one more thing to consider for you:

  • TG rely heavily to OS PageCache - it means most of reads are actually going into PageCache (you can see it for example by sudo cachestat-bpfcc from this tools https://github.com/iovisor/bcc)
  • It means OS need go to disk only on PageCache-miss (whole state is 32Gb - and it's hot part probably fits in memory)
  • It's true that some data can be prefetched in async manner - for example "reading blocks" https://github.com/ledgerwatch/turbo-geth/blob/master/eth/stagedsync/stage_execute.go#L188 - they are never in PageCache because each block read/used only once during "phase 5 (blocks execution)".
  • what really can solve the problem - parallel blocks execution - right now TG and Geth can exec 1 transactions only one-by-one. But it's very far from go live.

And here is some common architecture decisions made by TurboGeth:

  • almost no concurrency - because: 1. real disks (even NVMe) are much faster when no concurrency 2. It simlify code and profiling (1 thing happening at a time and you are clearly see the bottleneck). 3. Different cloud providers can do different magic - it's hard to fit them all (and fit real disks) in same time.
  • good news - if you do serve lot's of RPC requests - they will happen in parallel without any locks.
  • so, generally - in TG only 1 write transaction can happen at a time (and this TX will stay within 1 thread), and unlimited read transactions can happen in the same time.

What is your blocks execution speed now? (what you see in logs?).

@JekaMas
Copy link
Contributor

JekaMas commented Feb 25, 2021

Hi. Here is one more thing to consider for you

I think the whole answer should be in Q&A readme section. The questions are really good and common.

@aasseman
Copy link
Author

aasseman commented Feb 25, 2021

I get 0.1 blocks per second (syncing blocks 11M+) on a dedicated local 7200rpm SAS HDD (not a cloud machine). I got about the same in the cloud.

  1. real disks (even NVMe) are much faster when no concurrency

My experience is only with Linux, so not sure how intelligently other OSes manage this.

It really depends on which scheduler has been chosen for the device. Turns out that linux chooses automatically blk-mq [source] for NVMe devices, and can take advantage of parallelism at the hardware level if the devices allows [source]. So it's highly probable that there is more performance to be squeezed even from NVMe SSDs. In the worst case, there will be no performance penalty.

When using an HDD, the default scheduler will take advantage of the elevator algorithm, which will considerably increase throughput.

In any case, I am convinced that looking into async IO can only benefit here. But the best would be to get a real Kernel IO guy to give his 2 cents...

@AskAlexSharov
Copy link
Collaborator

Interesting. Thank you.

  • Problem is: there is no much things to parallelize - because no much things happening: algo is "read the block, execute it, read next block, execute it, when collect enough changes do write/commit, read next block, ...". And if no feature like "parallel blocks execution" - then no much to read in parallel (writes must already happen by rare big batches - writes are done in RAM and then do fsync/msync syscall on mmap'ed file).
  • Some "warmup" of data when node starts - can be higly-parallelized - but our experiments with warmup didn't show enough speedup on ssd/nvme/hdd (non-cloud).
  • "parallel blocks execution" - is very fundamental problem - need somehow analyze contracts and proof that one contract will not read data written by another contract - somebody worked on it but I don't know the status (maybe other our guys know).

@AskAlexSharov
Copy link
Collaborator

Another option: to make smart-contracts execution more "sequential reads friendly" - instead of random get/set mindset.
Another option: increase page size of our database - it's now 4Kb - but if it will be 8Kb(or more) then it will produce multiple page-misses and probably OS will group them. But we didn't experiment with increasing page size (it for sure has pros and cons) - because it's not in priority now.
Another option: switch to Silkworm (our C++ implementation of "phase 5") - it's generally 3 times faster and has better manual caching - maybe it will help in your scenario.

@AlexeyAkhunov
Copy link
Contributor

Yes, currently the way Ethereum is designed makes is hard to parallelise - because there is a large global state and all accesses to it from transaction requires "Serialisable" isolation level. There are things we can do that will work well in many cases, but as @AskAlexSharov said, it is not going to be ready soon

@aasseman
Copy link
Author

What initially convinced me to post here is that I noticed that even though the execution is progressing at ~0.1 blk/s, iostat shows me ~27 reads/s. So that would mean ~270 individual reads for the execution of a single block.

Isn't it possible to issue those ~270 reads asynchronously, instead of looking at parallelizing block execution?

Sorry if my suggestions are naive, I am trying to interpret what's happening on my hardware.

Here's a snapshot of what I was observing:
tg (UTC time):

INFO [02-25|03:24:43.595] [5/14 Execution] Executed blocks         number=11799290 blk/second=0.115 batch=11.10MiB  alloc=311.74MiB sys=2.65GiB   numGC=313
INFO [02-25|03:25:17.641] [5/14 Execution] Executed blocks         number=11799294 blk/second=0.118 batch=11.17MiB  alloc=206.46MiB sys=2.65GiB   numGC=314
INFO [02-25|03:25:44.221] [5/14 Execution] Executed blocks         number=11799297 blk/second=0.115 batch=11.22MiB  alloc=236.53MiB sys=2.65GiB   numGC=314
INFO [02-25|03:26:17.922] [5/14 Execution] Executed blocks         number=11799301 blk/second=0.121 batch=11.27MiB  alloc=276.25MiB sys=2.65GiB   numGC=314
INFO [02-25|03:26:49.407] [5/14 Execution] Executed blocks         number=11799304 blk/second=0.097 batch=11.32MiB  alloc=310.39MiB sys=2.65GiB   numGC=314

iostat -xt 10 (Pacific time):

02/24/2021 07:25:08 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.07    0.01    5.73   10.71    0.00   80.48

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
sde             26.80    107.20     0.00   0.00   36.74     4.00  101.90    307.20     0.20   0.20   10.31     3.01    0.00      0.00     0.00   0.00    0.00     0.00    1.88  99.48


02/24/2021 07:25:18 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.12    0.00    4.09   10.48    0.00   82.31

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
sde             27.20    108.80     0.00   0.00   36.21     4.00  101.30    305.20     0.20   0.20   10.53     3.01    0.00      0.00     0.00   0.00    0.00     0.00    1.90  99.68


02/24/2021 07:25:28 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.44    0.01   10.56   13.39    0.00   71.61

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
sde             27.10    108.40     0.00   0.00   36.46     4.00   99.60    300.00     0.20   0.20   10.38     3.01    0.00      0.00     0.00   0.00    0.00     0.00    1.88 100.00

@AskAlexSharov
Copy link
Collaborator

@asasmoyo , hi. I got some idea! Can you enable some filesystem compression feature? Logic: our DB has 4kb reading block, but zfs default compression block is 128Kb. It means reads will stay synchronous, but they will happen in “bigger batches”.

“Isn't it possible to issue those ~270 reads asynchronously, instead of looking at parallelizing block execution?” - sorry, but it is synonyms. Next read can depend on information which smart contract got from previous read. It needs or support from solidity or kind of “EVM branch prediction” (which lead to security issues in CPU’s last year). Both are hardly doable things.

@tsutsu
Copy link
Contributor

tsutsu commented Apr 1, 2021

Just as another data-point — and maybe food-for-thought for @aasseman: I'm using TG (with the mdbx storage engine) on a Google Cloud n2d-standard-8 instance, where the TG data-dir is backed by four GCE "local scratch (NVMe) SSDs" (fixed-size 350GiB devices), RAID0ed together using mdraid, then mounted writeback. I'm getting much faster execution:

INFO [04-01|14:34:05.841] [5/14 Execution] Executed blocks         number=11696341 blk/second=7.655     batch=72.20MiB  alloc=313.75MiB  sys=5.76GiB numGC=65915
INFO [04-01|14:34:35.878] [5/14 Execution] Executed blocks         number=11696563 blk/second=7.400     batch=74.99MiB  alloc=289.93MiB  sys=5.76GiB numGC=65920
INFO [04-01|14:35:05.861] [5/14 Execution] Executed blocks         number=11696789 blk/second=7.793     batch=77.74MiB  alloc=235.11MiB  sys=5.76GiB numGC=65925
INFO [04-01|14:35:35.882] [5/14 Execution] Executed blocks         number=11697012 blk/second=7.433     batch=80.39MiB  alloc=403.66MiB  sys=5.76GiB numGC=65929
INFO [04-01|14:36:05.924] [5/14 Execution] Executed blocks         number=11697231 blk/second=7.300     batch=83.20MiB  alloc=283.97MiB  sys=5.76GiB numGC=65934
INFO [04-01|14:36:35.833] [5/14 Execution] Executed blocks         number=11697466 blk/second=8.103     batch=86.08MiB  alloc=413.59MiB  sys=5.76GiB numGC=65938
INFO [04-01|14:37:05.888] [5/14 Execution] Executed blocks         number=11697703 blk/second=7.900     batch=88.89MiB  alloc=289.72MiB  sys=5.76GiB numGC=65943
INFO [04-01|14:37:35.866] [5/14 Execution] Executed blocks         number=11697951 blk/second=8.552     batch=91.57MiB  alloc=472.86MiB  sys=5.76GiB numGC=65947
INFO [04-01|14:38:05.828] [5/14 Execution] Executed blocks         number=11698196 blk/second=8.448     batch=94.22MiB  alloc=333.72MiB  sys=5.76GiB numGC=65952
INFO [04-01|14:38:35.814] [5/14 Execution] Executed blocks         number=11698437 blk/second=8.310     batch=96.76MiB  alloc=367.00MiB  sys=5.76GiB numGC=65956

My run of iostat -xt 10 emits (with other devices trimmed):

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
nvme0n1       1710.40 203948.40   122.30   6.67    0.73   119.24   96.60    388.40     0.00   0.00    0.15     4.02    0.00      0.00     0.00   0.00    0.00     0.00    0.01  81.32
nvme0n2       1684.20 205067.60   121.30   6.72    0.78   121.76   69.80    279.60     0.00   0.00    0.15     4.01    0.00      0.00     0.00   0.00    0.00     0.00    0.01  81.60
nvme0n3       1661.30 205021.20   123.40   6.91    0.82   123.41   55.70    224.80     0.00   0.00    0.16     4.04    0.00      0.00     0.00   0.00    0.00     0.00    0.01  81.60
nvme0n4       1737.50 205470.40   117.10   6.31    0.83   118.26   46.40    186.00     0.10   0.22    0.15     4.01    0.00      0.00     0.00   0.00    0.00     0.00    0.01  81.88

How I originally formatted the device, following GCE guidance on use of local SSDs:

sudo mkfs.ext4 -F -E stripe-width=256,lazy_itable_init=0,lazy_journal_init=0,discard /dev/md0

How I mounted the device:

# in /etc/fstab
/dev/md0 /scratch ext4 discard,nobarrier,noatime,data=writeback,defaults,nofail 0 2

@tsutsu
Copy link
Contributor

tsutsu commented Apr 1, 2021

Next read can depend on information which smart contract got from previous read.

There is a way around this that we're discussing implementing internally at our company. Our target architecture is very different than TG (we're aiming for a node that executes synchronously once, and then N stateless trace nodes that re-execute in parallel without needing the whole blockchain to be available locally), but I think the technique can be adapted.

The idea is that any node that is executing a block "blind", can also be configured to capture, during execution, a listing of all the origin-storage that was read during the block's execution — accounts, storage slots, and code. Such nodes can put this "read-hint data" into some deterministic encoding, and make the "hint" available alongside the block in some way.

This hint doesn't help the node that executes the block "blind", of course—it's going to be just as slow as before, if not slower (because now it's persisting hits to the StateDB's origin-storage to an in-memory log.) But for any block that enters consensus, the vast majority of its executions will be re-executions, done with the block already deep in consensus history (if not deep in the individual node's chain), where it's fully possible for the node that's executing the block to grab some hint objects that other nodes have computed about that block after they executed it, before it actually executes the block itself. So the read-hint object can help all the other nodes in the network. (And not every node needs to generate such read-hints. As long as some nodes generate them, other nodes can just mirror them.)

Our own strategy for achieving stateless tracing, is to export both the origin-storage keys and values — i.e. accounts in {contractAddress, accountRLP} form; storage slots in {contractAddress, incarnationSeq, keyU256, dataU256} form; and code in {contractAddress, deployedBytecode} form. We wrap this together with the block+txs+receipts to form a "block specimen", which, as it turns out, is all a node needs[1] to "simulate" a blockchain that only has that one block in it for the purposes of re-execution — similar to how EVMLab simulates a blockchain containing a single tx. We get the node to throw that block-specimen out over a websocket during newHead, where it can be captured and pushed to an object store; and then some dataflow engine can consume those objects entirely in parallel.

[1] Okay, you also need to capture 256 previous block-hashes to make the BLOCKHASH opcode happy. (This, among other things (e.g. uint256 column-store deduplication) makes it more sensible for our "block specimen" use-case to export ranges of blocks at once to a single well-packed block-range specimen.)

Anyway, in TG's case, you don't need to capture the values. You just need to capture the keys that are read. This would form an "execution-time state-read hint" object—a list of keys that can all be prefetched into memory at the start of the block's execution. TG can make this hint object available on the P2P protocol (introducing an extension to the eth protocol, some kind of "fetch read-hint for block of hash H" message); and other TG nodes can fetch these read-hints during phased sync, before the execution phase, to accelerate the execution phase. (Presumably, the node could allow the user to configure it to throw the execution read-hints away after it uses them; but if the user doesn't do that, then the node would become a mirror for these read-hints.)

Since this read-hint data doesn't specify the values to be read, there's no real trust required in the generator of the read-hints; the worst they can do is to waste other node's time by telling nodes to prefetch stuff that isn't actually part of the block's execution. And nodes could also verify the read-hint after-the-fact (i.e. after the block is done executing), just by generating a read-hint themselves during execution, and comparing the read-hint they generated to the one they relied upon at the beginning. Inaccurate read-hints would be discarded and replaced with the node's own computed one. In this way, accurate read-hints would propagate in the network.

@tsutsu
Copy link
Contributor

tsutsu commented Apr 1, 2021

Note also that EIP-2930 transactions also include data that can be used as a "read-hint" in this sense (this being much of the motivation behind the EIP.) If the chain accepts, and then transitions to, EIP-2930-style transactions, then external read-hint objects would mostly no longer be needed past that point — the "read-hint" for a block could be computed just-in-time by combining the EIP-2930 tx read-lists with a few pieces of block metadata, e.g. the miner/validator account address.

External read-hint objects would still provide a benefit for historical txs below the transition point, however; and would also act as a fallback for non-EIP-2930 txs, if transition to them is less-than-total.

The interesting thing is that the same internal machinery within the StateDB would be necessary to drive EIP-2930 read-list generation, as would be needed to drive the generation of external read-hints. So if TG expects to support EIP-2930, it would need to include this logic anyway. Which would make an external read-hint feature already half-complete. 🙂

@aasseman
Copy link
Author

aasseman commented Apr 5, 2021

Thanks all for your suggestions!

Sorry if I am asking a naive question:
Isn't the goal of the node synchronization to download blocks of transactions from other nodes and double-check them (by re-executing them)? In which case, wouldn't it be possible, once all the blocks have been retrieved, to check their validity independently (and in parallel)? I know that in that case you would have to check the block's validity against blocks that weren't already checked. However, under the assumption that there is a low probability that the blocks are invalid, there's still a high chance for improved synchronization speed, and in the case that an invalid block is found, you would only have to recheck all the blocks that you already checked against it.

@AskAlexSharov
Copy link
Collaborator

@aasseman here is an example: “Account A has 10ETH, account B has 0ETH. Block 1 transfer 10ETH from account A to B, block 2 transfer 10ETH from B to A.”.

Now imagine you wanna validate blocks 1 and 2 in parallel and you start from block 2. Transaction in block 2 will fail because account B has 0ETH. This is of course incorrect because after block 1 account B has 10ETH.

@aasseman
Copy link
Author

aasseman commented Apr 6, 2021

@AskAlexSharov I see, what I missed is that blocks contain only tx information, but no state...
And the other thing is that transactions within a block also must be executed in a precise order, if I understood it well.

I think that now I understand the constraints a little better. Thanks for taking the time to educate me!

So, the process is that you update the state of the network as transactions are replayed in sequential order since tx 0 of block 0.
How large is that state? I would suppose that if it is in the 10s of GB, most of it would fit into RAM and not require to wait for reads (~270 sync reads going to disk per block in my experiment). Is there a way to force storage of the entirety of the current state in RAM?

@AskAlexSharov
Copy link
Collaborator

All you say is correct. State is 10s of GB. But also you need blocks itself: it’s 100Gb. TG works this way - it automagically (bu lazy) storing hot part of DB in RAM, if your machine has much RAM then entire state will stored in RAM. See how Linux mmap and page cache works, but we can’t use ReadAhead feature of mmap because entire TG db is 1Tb. FYI: on 256Gb machine with ssd genesis sync take 36hours.

You can try flag —batchSize=8000M, but it only for state, not for blocks info.

I see only one low-hanging fruit here: warmup blocks info in background (like read 1K blocks ahead). But i think it will not enough for your case. In short: we always targeted real hardware, to support network drives well - need servers and some set of users who need it or ready to pay for it. So far, you are first who requested it. Thank you.

Good news - you can sync on one hardware, but then copy tg datadir to another hardware. Or you can try to resize machine to 256Gb RAM and after genesis sync resize machine down. JsonRPC-load is more parallel and maintaining chain head probably will work well enough.

@AskAlexSharov
Copy link
Collaborator

closing for now, because unlikely we will do something to support network disks in near future. but your PR's are welcome.

@qkum
Copy link

qkum commented Dec 14, 2021

All you say is correct. State is 10s of GB. But also you need blocks itself: it’s 100Gb. TG works this way - it automagically (bu lazy) storing hot part of DB in RAM, if your machine has much RAM then entire state will stored in RAM. See how Linux mmap and page cache works, but we can’t use ReadAhead feature of mmap because entire TG db is 1Tb. FYI: on 256Gb machine with ssd genesis sync take 36hours.

You can try flag —batchSize=8000M, but it only for state, not for blocks info.

I see only one low-hanging fruit here: warmup blocks info in background (like read 1K blocks ahead). But i think it will not enough for your case. In short: we always targeted real hardware, to support network drives well - need servers and some set of users who need it or ready to pay for it. So far, you are first who requested it. Thank you.

Good news - you can sync on one hardware, but then copy tg datadir to another hardware. Or you can try to resize machine to 256Gb RAM and after genesis sync resize machine down. JsonRPC-load is more parallel and maintaining chain head probably will work well enough.

Aha, so a big trick is simply to rent a "fully decked out" VM at like google for 2 days and then downgrade it, or download the entire DB down to your pc and stop the paid VM from there on.

That would be a fast & pretty cheap solution.

But would take like a week to download that 1.3 TB, DB :)

  • I will consider it because 25 GB a day for genesis sync is a bit too slow for my taste.
  • I'm running Erigon Stable on Ubuntu 18.04, on a Virtual Box VM, running on my External USB 3, HDD.

@AskAlexSharov
Copy link
Collaborator

«download that 1.3 TB» Erigon’s db 2x compressible by lz4.

We continue working on snapshot sync, which will speedup genesis sync, but no eta yet.

@syalovit
Copy link

«download that 1.3 TB» Erigon’s db 2x compressible by lz4.

We continue working on snapshot sync, which will speedup genesis sync, but no eta yet.

I'm running into the same issue with polygon, specifically raid0 and ebs seem to be performing equivalently. My hunch is that polygon block sizes are generally so large that a majority of the block write latency is spent in the kernel taking apart the massive blocks that erigon is writing to be aligned with what can be consumed by the underlying pages, and SSD hardware. Can you comment on any tuning that can be done in erigon to validate that?

@AskAlexSharov
Copy link
Collaborator

“integration warmup —table=PlainState” may take 1 table from disk to pageCache - in multi-thread way. May be good for cold start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants