Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Upstream v3.0.0 #215

Open
wants to merge 5,134 commits into
base: op-erigon
Choose a base branch
from
Open

[WIP] Upstream v3.0.0 #215

wants to merge 5,134 commits into from

Conversation

mininny
Copy link
Member

@mininny mininny commented Aug 13, 2024

No description provided.

taratorio and others added 30 commits August 6, 2024 16:29
part of erigontech#11032

Fixes "block too soon error" which was due to incorrectly calculated
block producer priorities.
```
DBUG[08-06|13:31:23.898] [sync] onNewBlockEvent: couldn't connect a header to the local chain tip, ignoring err="canonicalChainBuilder.Connect: invalid header error Block 10391894 was created too soon. Signer turn-ness number is 2\n"
```

Removes `SpansCache` in favour of `heimdallService.Producers` which
returns correct producer priorities.
Simplifies PatriciaContext interface: return state Update instead of
filling the Cell is more general and allows other trie implementation
follow that interface without converting their inner representation of
Cell/Node/etc.

Slightly reduces code complexity as now Cell has state part (Update) and
intrinsic parts (lens and other unexproted fields like
apk/spk/downHashedKey).

Finally that allows us as of next step to remove batch of `Process*`
functions and keep just `Process(context.Context, updates *Updates,
logPrefix string) ([]byte, error)`. In that case `Updates.mode` only
will decide if need to collect `Update` during execution or not. In
general, we don't really need keep Update close to key because it's
already in `SharedDomains` if it's just a regular exec.
Part of erigontech#11149
Teku vc is able to work properly with this pr.
 
known issue: teku's event source client periodically disconnects, but vc
is still can well work
event to be added:
block_gossip
chain_reorg
light_client_finality_update
light_client_optimistic_update
payload_attributes
- Updated gopsutil version as it has improvements in getting processes
and memory info.
`murmur3.New*` methods return interface. And need call minimum 3 methods
on it.
`16ns` -> `11ns`


Also i did bench `github.com/segmentio/murmur3` vs
`github.com/twmb/murmur3` on 60bytes hashed string
2nd is faster but adding asm deps. So i stick to go's dep (because asm
deps are not friendly for cross-compilation), maybe will try it later -
after our new release pipeline is ready. Bench results:
intel: `20ns` -> `14ns`
amd: `31ns` -> `26ns`
Before this PR we called heimdall.Synchronize as part of
heimdall.CheckpointsFromBlock and heimdall.MilestonesFromBlock. The
previous implementation of Synchronize was waiting on all scrappers to
be synchronised.

This is inefficient because `heimdall.CheckpointsFromBlock` needs only
the `checkpoints` scrapper to be synchronised. For the initial sync we
first only need to wait for the checkpoints to be downloaded and then we
can start downloading blocks from devp2p. While we are doing that we can
let the spans and milestones be scrapped in the background. Note this is
based on the fact that fetching checkpoints has been optimised by doing
bulk fetching and finishes in seconds, while fetching Spans has not yet
been optimised and for bor-mainnet can take a long time.

Changes in the PR:
- splits Synchronize into 3 more fine grained SynchronizeCheckpoints,
SynchronizeMilestones and SynchronizeSpans calls which are invoked by
the Sync algorithm at the right time
- Optimises SynchronizeSpans to check if it already has the
corresponding span for the given block number before blocking
- Moves synchronisation point for Spans and State Sync Events in
`Sync.commitExecution` just before we call
ExecutionEngine.UpdateForkChoice to make it clearer what data is
necessary to be sync-ed before calling Execution
- Changes EventNotifier and Synchronize funcs to return err if ctx is
cancelled or other errors have happened
- Input consistency between the heimdallSynchronizer and
bridgeSynchronizer - use blockNum instead of *type.Header
- Interface tidy ups
Make Cell unexported
Remove ProcessTree/Keys/Update
Reviewed and refreshed all unit/bench/fuzz tests related to commitment
erigontech#11326
Change test scheduling and timeouts after Ottersync introduction.
Now we can execute tests more frequently due to the significant
reduction in test time.

Scheduled to run every night:
- tip-tracking
- snap-download
- sync-from-scratch for mainnet, minimal node

Scheduled to run on Sunday:
- sync-from-scratch for testnets, archive node
- Collecting CPU and Memory usage info about all processes running on
the machine
- Running loop 5 times with 2 seconds delay and to calculate average
- Sort by CPU usage
- Write result to report file
Result:
![Screenshot 2024-08-07 at 18 40
08](https://github.com/user-attachments/assets/aac1264c-1eb9-4c8e-b6a6-7e248e37855a)
closes erigontech#11173

Adds tests for the Heimdall Service which cover:
- Milestone scrapping
- Span scrapping
- Checkpoint scrapping
- `Producers` API - compares the results with results from the
`bor_getSnapshotProposerSequence` RPC API
- Added totals for CPU and Memory usage to processes table
- Added CPU usage by cores

Example output:
![Screenshot 2024-08-08 at 12 46
17](https://github.com/user-attachments/assets/ec0897d0-81c8-4436-bb65-527363157e76)
forgot to silence the logging in the heimdall service tests in a
previous PR
the logging lvl can be tweaked at times of need if debugging is
necessary
Refactored table utils to have an option to generate table and return it
as string which will used for saving data to file.
…ot and added clearIndexing command (erigontech#11539)

Main checks:
* No gap in steps/blocks
* Check if all indexing present
* Check if all idx,  history, domain present
closes erigontech#11177
- adds unwind logic to the new polygon sync stage which uses astrid
- seems like we've never done running for bor heimdall so removing empty
funcs
Refactored printing cpu info:
- move CPU details to table
- move CPU usage next to details table
- refactor code
…ch#11549)

relates to:
erigontech#10734
erigontech#11387

restart Erigon with `SAVE_HEAP_PROFILE = true` env variable
wait until we reach 45% or more alloc in stage_headers when
"noProgressCounter >= 5" or "Rejected header marked as bad"
and also move `design` into `docs` in order to reduce the number of
top-level directories
Before we had transaction-wide cache (map)
Now i changing it to evm-wide.
EVM - is thread-unsafe object - it's ok to use thread-unsafe LRU.
But ExecV3 already using 1-ENV per worker. Means we will share between
blocks (not on chain-tip for now)

bench: 
- on `mainnet`: it shows 12% improvement on large eth_getLogs call
(re-exec large historical range of blocks near 6M block) - on hot state.

About chain-tip: 
- don't see much impact (even if make cache global) - because
mainnet/bor-mainnet current bottleneck is "flush" changes to db. but
`integration loop_exec --unwind=2` shows 5% improvement.
- in future PR we can share 1 lru for many new blocks - currently
creating new one every stage loop iteration.
shohamc1 and others added 30 commits September 3, 2024 12:23
…se workflow (erigontech#11848)

New workflow ci-cd-main-branch-docker-images.yml
New Dockerfile targets for the new workflow in Dockerfile.release
Changes in release workflow: rename arg.

See issue erigontech#10251 for more
info.
Add tip-tracking test for bor-mainnet using a dedicated self-hosted
runner
It is necessary when using temporal KV remotely.

Additional changes:
- remove what I think is an oversight in `IndexRange`, where
`req.PageSize` was checked and cut to `PageSizeLimit`, but then not used
(`PageSizeLimit` itself was used instead)
- remove useless `limit--` in `HistoryRange`
and rename to `Reader/Writer`
remove interfaces related to it - to improve inlining
Fix on-trigger (correct branch)
Grammar fixes
…ontech#11813) (erigontech#11866)

**Existing behaviour:**
- Add up the possible value that user must pay beforehand to buy gas
- Deduct that amount from the sender's account in `intraBlockState`,
but:
- Don't deduct the gas value amount if the user doesn't have enough, and
`gasBailout` is set

**New behaviour:**
- Don't check if sender's balance is enough to pay gas value amount, nor
deduct it, if `gasBailout` is set

**More rationale**
This would mean the sender's account would show `"balance": "="` in
`trace_call` rpc method, that is, no change, if gas is the only thing
the user pays for. This is fine because the gas price can fluctuate in a
real transaction. This also removes the inconsistency of sometimes
having to bother deducting the amount if it is less than sender's
balance, thereby causing a
bug/inconsistency.(erigontech#11813)
…ch#11867)

fixes erigontech#11818

issue was:
- when at tip we receive new block hashes and new block events
- we had an if statement which checked if the canonical chain builder
tip changed after connecting new headers to the tree
- that if statement was used to determine whether we should call
`InsertBlocks` for the blocks we've just connected and also to
`commitExecution` (call `UpdateForkChoice`)
- this meant that when at the tip, we would not insert new blocks which
would not change the tip of the canonical chain builder
- this is wrong because we should be inserting these blocks as they may
end up being on the canonical path several blocks later in case the
forks change in their favour based on the connected ancestors

fix is:
- augment `canonicalChainBuilder.Connect` to return the newly connected
headers to the tree
- always insert newly connected headers (upon successful connection to
the root)
)

Next chain tip error caught and fixed for astrid stage integration:
```
append with gap blockNum=11561329, but current height=11561327
```

happens after unwind due to a fork change in the corresponding fork
choice update.

This is due to a bug in the logic of handling fork choice updates in the
stage integration. The issue is that when processing the
`cachedForkChoice` after we have done the unwind, `fixCanonicalChain`
returns empty `newNodes` (correctly, since the chain was fixed before we
cached the fork choice). The solution is to cache the new nodes as the
`cachedForkChoice` so that when we process the cached fork choice in the
next iteration we can correctly update the tx nums for the new nodes.

Full logs:
```
INFO[09-04|16:14:31.018] [2/6 PolygonSync] update fork choice     block=11561328 age=0 hash=0x41ebb5e01406c1f013f06ee4e53ab68b125f071717d50bcdcfa4597a0a052cfe
INFO[09-04|16:14:31.019] [2/6 PolygonSync] new fork - unwinding and caching fork choice 
DBUG[09-04|16:14:31.021] UnwindTo                                 block=11561327 block_hash=0xc8ba20e1e4dc312bda4aadc5108722205693783b1c2d6103cb70949bda58a460 err=nil stack="[sync.go:171 stage_polygon_sync.go:1391 stage_polygon_sync.go:1356 stage_polygon_sync.go:1478 stage_polygon_sync.go:501 stage_polygon_sync.go:175 default_stages.go:479 sync.go:531 sync.go:410 stageloop.go:249 stageloop.go:101 asm_arm64.s:1222]"
DBUG[09-04|16:14:31.021] [2/6 PolygonSync] DONE                   in=5.45216175s
DBUG[09-04|16:14:31.021] [1/6 OtterSync] DONE                     in=21.167µs
INFO[09-04|16:14:31.021] [2/6 PolygonSync] forward                progress=11561327
INFO[09-04|16:14:31.021] [2/6 PolygonSync] new fork - processing cached fork choice after unwind 
INFO[09-04|16:14:31.022] [2/6 PolygonSync] update fork choice     block=11561328 age=0 hash=0x41ebb5e01406c1f013f06ee4e53ab68b125f071717d50bcdcfa4597a0a052cfe
DBUG[09-04|16:14:31.022] [2/6 PolygonSync] DONE                   in=186.792µs
DBUG[09-04|16:14:31.022] [3/6 Senders] DONE                       in=236.458µs
INFO[09-04|16:14:31.024] [4/6 Execution] Done Commit every block  blk=11561327 blks=1 blk/s=1125.7 txs=2 tx/s=2.25k gas/s=0 buf=0B/512.0MB stepsInDB=0.00 step=24.3 alloc=600.4MB sys=1.7GB
DBUG[09-04|16:14:31.024] [4/6 Execution] DONE                     in=2.020375ms
DBUG[09-04|16:14:31.024] [5/6 TxLookup] DONE                      in=74.292µs
DBUG[09-04|16:14:31.024] [6/6 Finish] DONE                        in=2.958µs
INFO[09-04|16:14:31.024] Timings (slower than 50ms)               PolygonSync=5.452s alloc=600.5MB sys=1.7GB
DBUG[09-04|16:14:31.025] [6/6 Finish] Prune done                  in=5.625µs
DBUG[09-04|16:14:31.025] [5/6 TxLookup] Prune done                in=237.084µs
DBUG[09-04|16:14:31.025] [4/6 Execution] Prune done               in=65.958µs
DBUG[09-04|16:14:31.025] [3/6 Senders] Prune done                 in=2.75µs
DBUG[09-04|16:14:31.025] [2/6 PolygonSync] Prune done             in=2.25µs
DBUG[09-04|16:14:31.025] [snapshots] Prune Blocks                 to=11559976 limit=10
DBUG[09-04|16:14:31.026] [snapshots] Prune Bor Blocks             to=11559976 limit=10
DBUG[09-04|16:14:31.026] [1/6 OtterSync] Prune done               in=1.334833ms
DBUG[09-04|16:14:31.154] [1/6 OtterSync] DONE                     in=6.792µs
INFO[09-04|16:14:31.154] [2/6 PolygonSync] forward                progress=11561327
DBUG[09-04|16:14:33.030] [bridge] processing new blocks           from=11561329 to=11561329 lastProcessedBlockNum=11561328 lastProcessedBlockTime=1725462871 lastProcessedEventID=2688
DBUG[09-04|16:14:33.030] [sync] inserted blocks                   len=1 duration=1.184125ms
DBUG[09-04|16:14:33.030] [bor.heimdall] synchronizing spans...    blockNum=11561329
DBUG[09-04|16:14:33.031] [bridge] synchronizing events...         blockNum=11561329 lastProcessedBlockNum=11561328
INFO[09-04|16:14:33.031] [2/6 PolygonSync] update fork choice     block=11561329 age=0 hash=0x298f72d6fbbfdc8d3df098828867dea7e8e7bba787c1eb17f6c6025afa9ac3d1
WARN[09-04|16:14:33.032] [bor.heimdall] an error while fetching   path=bor/latest-span queryParams= attempt=1 err="Get \"https://heimdall-api-amoy.polygon.technology/bor/latest-span\": context canceled"
DBUG[09-04|16:14:33.032] [bor.heimdall] request canceled          reason="context canceled" path=bor/latest-span queryParams= attempt=1
EROR[09-04|16:14:36.032] [2/6 PolygonSync] stopping node          err="append with gap blockNum=11561329, but current height=11561327, stack: [txnum.go:149 accessors_chain.go:703 stage_polygon_sync.go:1398 stage_polygon_sync.go:1356 stage_polygon_sync.go:1478 stage_polygon_sync.go:501 stage_polygon_sync.go:175 default_stages.go:479 sync.go:531 sync.go:410 stageloop.go:249 stageloop.go:101 asm_arm64.s:1222]"
DBUG[09-04|16:14:36.032] Error while executing stage              err="[2/6 PolygonSync] stopped: append with gap blockNum=11561329, but current height=11561327, stack: [txnum.go:149 accessors_chain.go:703 stage_polygon_sync.go:1398 stage_polygon_sync.go:1356 stage_polygon_sync.go:1478 stage_polygon_sync.go:501 stage_polygon_sync.go:175 default_stages.go:479 sync.go:531 sync.go:410 stageloop.go:249 stageloop.go:101 asm_arm64.s:1222]"
DBUG[09-04|16:14:36.033] rpcdaemon: the subscription to pending blocks channel was closed 
```
Should fix erigontech#11748 and erigontech#11670

---------

Co-authored-by: Mark Holt <mark@distributed.vision>
Added notifier which notify that torrent downloading completed.

---------

Co-authored-by: Mark Holt <mark@distributed.vision>
…igontech#11722)

As the value for each to address is not used, keep the same logic for
`froms` and `tos`,

---------

Signed-off-by: jsvisa <delweng@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.