Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes #12293

aarshkshah1992 · 2024-07-24T14:21:54Z

This issue aims to be a meta-issue to capture and track work that needs to be done to enhance correctness, performance, and stability of the ETH RPC API on snapshot synced nodes. Note that improving performance for ETH RPC API on archival nodes is out of scope for this issue and will be addressed by a future issue.

Our goal is to improve the developer experience (DX) for key partners, including:

ETH Subgraph Providers (e.g., Vulcanize)
EVM Explorers (e.g., Blockscout)
Cross-chain Bridges (e.g., Axelar)
DApp Developers using existing ETH tooling which relies on ETH RPC API

Correctness and data availability issues in the chain state Indexes used by the ETH RPC API

Currently, we maintain three primary indices on the chain state, which are essential for both correctness and performance of multiple ETH RPC APIs.

Transaction Index

Maps an ETH transaction hash to a Filecoin message CID.
This mapping is one-way; if it's not in the index, there's no way to go from an ETH transaction hash to a Filecoin message CID.
APIs depend on this index to lookup the Filecoin message CID for a given ETH transaction hash, and then lookup the message by CID to get the actual transaction data.
If an entry is not found in this Index for a given transaction hash, the subsequent lookup for the message will be slow and likely fail if the query pertains to an ETH transaction.

Message Index

Maps a Message("transaction") to it's tipset and block number
Heavily used by ETH RPC to lookup the tipset containing a given message CID to get the full message/receipts
A miss on this Index leads to a painfully slow search through the state store for the requested message

Event Index

Associates each tipset/block with the event logs for all transactions contained within.
This index is crucial for the functionality of the ETH Event APIs, which are extensively utilized by subgraph providers and bridges.
Issues with data availability in this index can lead to costly recomputation of tipsets to regenerate events or result in the Events API failing to return events that should be available—assuming the corresponding tipset is present in the chain state, these events should be reliably indexed.

All of the above indices suffer from some or all of the following problems that need to be fixed:

They're not reset and rehydrated when a node syncs from a snapshot
They're not automatically backfilled on node startup
The lotus-shed backfilling CLI that users rely on for manually backfilling the indices is broken as all the Indices are persisted in Sqlite and Sqllite only supports a single writer. This effectively means that backfilling races with indexing new/ongoing state transitions
Indexing is done asynchronously to tipset/message execution but APIs that rely on these indices do not account for the async nature of indexing which leads to racy data avability issues for lookups at/near the chain head
They are not subject to garbage collection when the Splitstore is GC'd, which means they retain data that will never be accessed and the indices keep growing in size. Although the space used by these indices is minimal compared to the overall chainstate, the accumulation of unnecessary data could potentially slow down index queries.
Despite being operational for over a year, features like the Message Index are still labeled as "EXPERIMENTAL" which cause confusion among Node operators.
Better logging and telemetry around index usage and backfilling.
Consistency checks and auto-repair mechanisms in the Indices to 1) Flag missing data in the indices for epochs that have achieved finality and 2) Lazy backfilling of missing data in the indices on user requests.

Correctness problems in the ETH Events API

Events: TOCTOU Race when subscribing to new events #12111 - Event Filter APIs have raciness that can return incorrect results.
The block hash does not match #10911 - Mismatch between the block hash returned by ETH Get Block API and the block hash returned by the ETH Events API. This one could have been caused by a re-org but a solid itest to verify that this is no longer a problem would be great.
eth_getFilterChanges returns "filter not found" #11589 AND Reject Eth subscriptions & filters through the gateway over HTTP #11153 Event Filter APIs should work with the HTTP Gateway as expected by ETH tooling.
Eth RPC: EthGetLogs should return explicit error when queried for not-existing block hash on all providers #10940 - eth_getLogs should differentiate between "processed the block it has no events" vs "never seen this block" errors. We already have the required scaffolding and metdata for this in place but need to fix the error handling here and write some solid tests

In-memory block caching for perf improvements

Multiple ETH RPC APIs frequently need to lookup Filecoin Tipsets and convert them to the correspondong Ethereum block representations. These lookups are performed on the chainstore which is expensive. We should cache these tipsets/blocks in an LRU cache. See #10520.

Miscellaneous correctness bugs from the backlog

getBlock return an Error #10909 -eth_getBlock does not confirm to ETH RPC spec for Filecoin null rounds (null rounds are a quirk in Filecoin and need to be handled correctly here).
Eth API: Trace installed bytecode in the Ethereum JSON-RPC trace_block output #11635 - The ETH Trace API currently fails to include the byte code of the deployed smart contract in the trace output for transactions that deploy smart contracts. IIRC, Blockscout really needed this to be able to show the contract byte code on their explorer.
EthGetTransactionCount ignores input #10357 - Correctness bug in eth_getTransactionCount.

The text was updated successfully, but these errors were encountered:

snissn · 2024-07-24T21:34:40Z

For any and all of these that have ways to reproduce it would be great to coordinate and add the RPC call behind any of these issues to the RPC benchmark tool that Fil-B has been maintaining - https://github.com/fil-builders/benchmark-rpc/blob/main/pages/index.js#L21

For example this issue #10940 should be easy to reproduce in a live test. I created a ticket for it FIL-Builders/benchmark-rpc#1 to add it to the web app http://benchmark-rpc.fil.builders/

rvagg · 2024-07-25T05:45:14Z

They're not reset and rehydrated when a node syncs from a snapshot

msgindex is @

lotus/cmd/lotus/daemon.go

Line 647 in 718fc03

    
           if err := index.PopulateAfterSnapshot(ctx, filepath.Join(basePath, index.DefaultDbFilename), cst); err != nil {

It at least has a pattern we can follow for others. But it also overlaps with a backfill operation, so we may end up taking care of snapshot import with a general backfill routine if we get that right.

Stebalien · 2024-07-30T22:54:40Z

This list seems pretty complete. IMO, the highest priority is fixing the indexing issues:

P0: Make sure that we backfill to the last indexed tipset on restart. We should never have "holes".
P1: Make sure that we wait for the index in some cases, or even eagerly force it. E.g., as we discussed on the call, all of the EthGetBlockBy* commands should force the indexer to index that block (and its parents).

Indexing is done asynchronously to tipset/message execution but APIs that rely on these indices do not account for the async nature of indexing which leads to racy data avability issues for lookups at/near the chain head

IMO, the best way to handle this case is the dance we discussed on the call:

Check the index. If present, return.
Wait for the index to reach the current head.
Check again.

This will miss uncles, but StateSearchMsg is designed to only find messages on the main chain.

I took a look at how geth handles stuff like this and... they also appear to index asynchronously and handle this case by returning an error if the node is currently indexing a block. That's not a terrible option... but it would be a larger breaking change.

BigLep · 2024-08-06T19:20:54Z

This is a great overview @aarshkshah1992 - thanks for writing it up. A few questions, some of which are coming from a newbie/ignorant-of-the-code perspective. I'm happy to chat on any of these elsewhere or offline, but figured to ask here so it's public.

Can any of the the ETH RPC code simplify when we have fast finality (F3)? If so, given we’re so close to that being live on the network (less than 2 months), should we do work here with that assumption in mind?
Assuming we "make sure that we backfill to the last indexed tipset on restart so we never have holes" and we "make sure that we wait for the index" can we simplify the code failing fast if we have an index miss? I'm eyeing your statements in the issue description about “the subsequent lookup for the message will be slow and likely fail if the query pertains to an ETH transaction” and “A miss on this Index leads to a painfully slow search through the state store for the requested message”. It's presumably not good to be in this state, but would it better off to fail fast then do expensive state traversals? (I assume private Lotus RPC methods can be invoked if someone needs the slow path.)
Is there value in having separate db's? (I don't have insight here - just curious.). I see there is an issue about consolidating in Enable efficient indexing of historical chain data #10807 . (Maybe it's best to discuss this question there.)

aarshkshah1992 · 2024-08-07T08:49:34Z

@BigLep

I don't think there's any work here that's bound to a certain notion of finality and so not sure if F3 changes anything here in terms of the work we need to do on ETH RPC/Chain state indexing.
I don't think there is any value in having separate DBs but @Stebalien can confirm the team's line of thinking when this was implemented.

Let me look into 2.

Stebalien · 2024-08-08T02:10:54Z

I don't think there's any work here that's bound to a certain notion of finality and so not sure if F3 changes anything here in terms of the work we need to do on ETH RPC/Chain state indexing.

That depends on what we do with the API. If F3 is "fast enough", we could just not expose anything after finality. But... that's probably not going to work well.

I don't think there is any value in having separate DBs but @Stebalien can confirm the team's line of thinking when this was implemented.

I agree there's no reason to keep them separate.

aarshkshah1992 · 2024-08-08T13:26:23Z

@Stebalien @BigLep @rvagg

In the first pass of this work, we're not going to work on merging the DBs for these as that is a larger refactor and will need a non-trivial migration for users and we've not estimated it yet.

Let's get to it once we've fixed all the other problems here.

BigLep · 2024-08-22T17:26:58Z

In the first pass of this work, we're not going to work on merging the DBs for these as that is a larger refactor and will need a non-trivial migration for users and we've not estimated it yet.

Let's get to it once we've fixed all the other problems here.

For visibility, it was decided that it would be useful to merge the DBs into a single DB. The work is happening in #12421

aarshkshah1992 added the kind/feature Kind: Feature label Jul 24, 2024

aarshkshah1992 changed the title ~~Correctness and Performance problems in ETH RPC: Meta Issue~~ Meta Issue: Fixing high impact correctness and performance problems in ETH RPC Jul 24, 2024

aarshkshah1992 changed the title ~~Meta Issue: Fixing high impact correctness and performance problems in ETH RPC~~ Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes Jul 24, 2024

aarshkshah1992 added the area/eth-api label Jul 24, 2024

snissn mentioned this issue Jul 24, 2024

Add check for EthGetLogs return explicit error when queried for not-existing block hash FIL-Builders/benchmark-rpc#1

Open

rjan90 added this to FilOz Jul 25, 2024

rjan90 moved this to 🐱Todo in FilOz Jul 25, 2024

rjan90 moved this from 🐱Todo to 📌 Triage in FilOz Jul 25, 2024

rjan90 added this to the DX-Streamline milestone Aug 2, 2024

aarshkshah1992 mentioned this issue Aug 8, 2024

feat: ETH RPC: Use Block Cache for EthGetBlockByHash #12359

Merged

aarshkshah1992 mentioned this issue Aug 16, 2024

[WIP] A new ChainIndexer that replaces the fragmented MsgIndex, EthTxHashIndex and EventsIndex #12388

Closed

BigLep mentioned this issue Aug 16, 2024

feat(badger): add support for Badger version 4. The default remains Badger version 2 to ensure backward compatibility. #12316

Open

13 tasks

rjan90 mentioned this issue Aug 21, 2024

eth_getTransactionReceipt fails to find transaction #11583

Open

11 tasks

rjan90 moved this from 📌 Triage to ⌨️In Progress in FilOz Aug 21, 2024

akaladarshi mentioned this issue Aug 21, 2024

Implement a lotus-shed migration command to migrate existing indexes to chain indexer #12408

Closed

9 tasks

aarshkshah1992 mentioned this issue Sep 12, 2024

A new ChainIndexer that subsumes that existing MsgIndex, EventIndex and TransactionIndex #12453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes #12293

Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes #12293

aarshkshah1992 commented Jul 24, 2024 •

edited

Loading

snissn commented Jul 24, 2024 •

edited

Loading

rvagg commented Jul 25, 2024

Stebalien commented Jul 30, 2024

BigLep commented Aug 6, 2024

aarshkshah1992 commented Aug 7, 2024

Stebalien commented Aug 8, 2024

aarshkshah1992 commented Aug 8, 2024 •

edited

Loading

BigLep commented Aug 22, 2024 •

edited by aarshkshah1992

Loading

Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes #12293

Meta Issue: Fixing high impact correctness and performance problems in ETH RPC API for snapshot synced nodes #12293

Comments

aarshkshah1992 commented Jul 24, 2024 • edited Loading

Correctness and data availability issues in the chain state Indexes used by the ETH RPC API

Transaction Index

Message Index

Event Index

All of the above indices suffer from some or all of the following problems that need to be fixed:

Correctness problems in the ETH Events API

In-memory block caching for perf improvements

Miscellaneous correctness bugs from the backlog

snissn commented Jul 24, 2024 • edited Loading

rvagg commented Jul 25, 2024

Stebalien commented Jul 30, 2024

BigLep commented Aug 6, 2024

aarshkshah1992 commented Aug 7, 2024

Stebalien commented Aug 8, 2024

aarshkshah1992 commented Aug 8, 2024 • edited Loading

BigLep commented Aug 22, 2024 • edited by aarshkshah1992 Loading

aarshkshah1992 commented Jul 24, 2024 •

edited

Loading

snissn commented Jul 24, 2024 •

edited

Loading

aarshkshah1992 commented Aug 8, 2024 •

edited

Loading

BigLep commented Aug 22, 2024 •

edited by aarshkshah1992

Loading