Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth_sendRawTransaction: remove logging, because we have --txpool.trace.senders which enabling more verbose logging for given senders list #7686

Merged
merged 2 commits into from
Jun 11, 2023

Conversation

mh0lt
Copy link
Contributor

@mh0lt mh0lt commented Jun 8, 2023

When testing with Bor consensus turned on I discovered that SendRawTransaction returns a 0x000... hash when transactions are submitted during block transitions. This turns out to be spurious in the sense that the transaction insertion is successful.

The cause is that ReadCurrentBlockNumber returns a nil block number.

This in turn is caused by the following: In accessors_chain.go there are two methods: WriteHeader and WriteHeadHeaderHash when the first is called the block number is written for the header. The second writes the header has, but there is no guarantee when it does that the head header will have been written yet. In fact it seems to happen sometime later.

The problem for SendRawTransation is that it begins a transaction after inserting into the txpool. And depending on timing this transaction may see only the WriteHeadHeaderHash insertion, and hence can't read the block number.

I have mitigated this by opening the db transaction before calling the tx pool insertion, meaning that it is more likely to have a clean view of the DB.

I have also moved the chain id check earlier in the code - as I think that if this is invalid the method should not try to insert transactions in the first place.

The ReadCurrentBlockNumber is only used to produce a log message - so I've changed this to not fail the whole function but to just log an unknown sender. Which means that the hash is still returned to the sender after a successful txpool insertion

@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Jun 8, 2023

ReadCurrentBlockNumber may fail during block transitions. The root cause is the fact that rawdb.WriteHeadHeaderHash is called
// prior WriteHeader - which can leave the DB in a state where the header hash has no block number mapping.

no, it’s not true. We using db transactions and “all or nothing” will be committed at once. DB is never in this state: except initialSync - which is fine.

In the past Erigon also had corner-case at firstSyncCycle (after start) - but it’s fixed by: #7532

so, on tip of chain now - all block changes are done in 1 ACID transaction.

@mh0lt
Copy link
Contributor Author

mh0lt commented Jun 8, 2023

So that change is in the codebase - I presume.

If that is the case I think that I may have a different issue - because when I run Bor consensus in the devnet test I am doing I definitely see it several times during the test run of 200 transactions and it is not necessarily in the first block.

I put some local logging in to test and could see the transactions process interleaved.

If I revert my change - which I have locally I can do a run and give you more detail as to how it happens - as I get it almost every time I run.

I think that the WriteHeadHeaderHash originates in startHandlingForkChoice

@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Jun 8, 2023

@mh0lt startHandlingForkChoice is part of "staged sync" and rw transaction is managed in StageLoopStep (which run all stages - including unwind - in 1 rwtx).

@mh0lt
Copy link
Contributor Author

mh0lt commented Jun 8, 2023

Apologies - I scrolled up too far in the code. I think its more likely: HeadersPOW. I'll reproduce the test and confirm. Will be in a couple of hours as I'm just looking at a transaction propagation issue.

@AskAlexSharov
Copy link
Collaborator

thank you

@mh0lt
Copy link
Contributor Author

mh0lt commented Jun 8, 2023

Attached is the debug location of the write location and the full run log:

accessors_chain
read_header_number.log

The pattern which is repeated is as follows:

You can see that the read gets a null back, and that there is an interspersed write operation, with a read which is working on a previous context, I don't think the print order is particularly significant (I used println rather than a log) as its enough to see what is going on.

There didn't seem to be a clear way of printing a transaction id so I've just used the address. Which may be not correct - but it seems like all of the prints are using a different transaction (though this could be wrong)

ReadHeaderNumber 0xdb39c298f4894f4e6b1ca8659f0055a48512d6f466b486723d6291cb38023eb7 1 0xc0010bf780
[INFO] [06-08|20:59:58.166] local tx propagated                      tx_hash=13b895e5945382d0de0014a75d2ee520eabbb6c28cc896768d55c4eb3413384a announced to peers=1 broadcast to peers=1 baseFee=1000000000
[INFO] [06-08|20:59:58.168] Submitted transaction                    hash=0x13b895e5945382d0de0014a75d2ee520eabbb6c28cc896768d55c4eb3413384a from=0x67b1d87101671b127f5f8714789C7192f7ad340e nonce=35 recipient=0x71562b71999873DB5b286dF957af199Ec94617F7 value=0x2d1f
[INFO] [06-08|20:59:58.171] http://localhost:8545#eth_sendRawTransaction time=0.008
[INFO] [06-08|20:59:58.171] [2/15 Headers] Processed                 highest inserted=2 age=12s
[DBUG] [06-08|20:59:58.171] [2/15 Headers] DONE                      in=13.158611s
ReadHeaderNumber 0x0d9b4685d3e20a6e319f8a9afbba073bcc390b8eba044718767349be19dd297d <nil> 0xc0010bfbc0
[INFO] [06-08|21:00:10.595] local tx propagated                      tx_hash=80cf4ea8a0d9d0c7d490914ea4483827ab174476a26c440d0ff5945acc911018 announced to peers=1 broadcast to peers=1 baseFee=1000000000
[INFO] [06-08|21:00:10.601] http://localhost:8545#eth_sendRawTransaction time=12.430
[EROR] [06-08|21:00:10.601] failed SendTransaction                   error="Request: 42, hash: 0x80cf4ea8a0d9d0c7d490914ea4483827ab174476a26c440d0ff5945acc911018, nonce  36: returned a zero transaction hash"
[DBUG] [06-08|21:00:10.602] Start mining new block based on miner.recommit
[DBUG] [06-08|21:00:10.605] [3/15 CumulativeIndex] DONE              in=12.4338955s
WriteHeaderRaw 0x0d9b4685d3e20a6e319f8a9afbba073bcc390b8eba044718767349be19dd297d 2 0xc001887200
WriteHeadHeaderHash 0x0d9b4685d3e20a6e319f8a9afbba073bcc390b8eba044718767349be19dd297d 0xc001887230
[DBUG] [06-08|21:01:36.622] Start mining new block based on miner.recommit
[INFO] [06-08|21:01:36.623] [4/15 BlockHashes] DONE                  in=1m26.0163087s
[INFO] [06-08|21:01:36.644] [1/5 MiningCreateBlock] Start mine       block=2 baseFee=878906250 gasLimit=10019538
ReadHeaderNumber 0xdb39c298f4894f4e6b1ca8659f0055a48512d6f466b486723d6291cb38023eb7 1 0xc0004f41a0
[DBUG] [06-08|21:01:36.645] [1/5 MiningCreateBlock] DONE             in=1.8474ms
[DBUG] [06-08|21:01:36.647] [5/15 Bodies] DONE                       in=23.438ms
[DBUG] [06-08|21:01:47.123] Start mining new block based on miner.recommit
[DBUG] [06-08|21:01:47.126] [txpool] Commit                          written_kb=24 in=1m36.5271432s
[INFO] [06-08|21:01:47.129] [txpool] stat                            pending=0 baseFee=37 queued=0 alloc=29.9MB sys=55.7MB
[DBUG] [06-08|21:01:47.129] [txpool] Commit                          written_kb=4 in=0s
[INFO] [06-08|21:01:47.130] [p2p] GoodPeers                          eth68=1
ReadHeaderNumber 0x0d9b4685d3e20a6e319f8a9afbba073bcc390b8eba044718767349be19dd297d 2 0xc0011e40e0

@mh0lt
Copy link
Contributor Author

mh0lt commented Jun 8, 2023

All of this goes away with my change above - although clearly it is avoiding the issue rather than fixing it.

@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Jun 9, 2023

You can see that the read gets a null back - i don't see - where I need look? at which log line?
sorry, don't understand the root of problem yet.

On your screenshot WriteHeadHeaderHash accepting tx as a first parameter - it's write-db-transaction, so RPC handlers must not see "partial writes". If it's somehow happening - we must guarantee "no partial writes visibility" on writer side - rather than do workarounds on reader side (be cause we have hundreds of RPC methods and dozens of db tables - can't workaround all this cases).

@mh0lt
Copy link
Contributor Author

mh0lt commented Jun 9, 2023

Its on line 7 of the log:

ReadHeaderNumber 0x0d9b4685d3e20a6e319f8a9afbba073bcc390b8eba044718767349be19dd297d 0xc0010bfbc0

The following lines show the error being propagated to the client.

I refactored the code because I think there are actually 2 issues here:

  1. In SendRawTransaction the transaction is actually inserted into the TX pool ok, the error is generated as a side effect of producing the log message. I don't think that an error in logging should cause a change in the return from this function. It actually implies a fail when the transaction insertion was a success. This could lead the client to tryiung again which will create a duplicate transaction.
  2. The partial write issue which you have described.

I agreed with Alexy that for moment to just check in the first fix as it unblocks my testing. The second I think needs more investigation and possibly a greater understanding of the flow than I have at the moment.

I'm happy to look at it if you think I should - however I'd like to of it after checking in some other devnet enhancements and bor fixes which I have in my dev environment which are dependent on SendRawTransaction working consistently - which this fix enables.

@AskAlexSharov
Copy link
Collaborator

  1. Let’s remove all non-required code. Logging can be less verbose - it’s fine.
  2. ok, let’s unblock you somehow. If you can create some script - to reproduce problem or teach me - i think it will be important - because if it’s true it may be reason of other bugs (I don’t remember any similar bugs).

@mh0lt
Copy link
Contributor Author

mh0lt commented Jun 9, 2023

I've removed the logging code in the latest check in.

Once I have the bor flow I'm working on I'll create a flow which reproduces the issue. I may be able to suggest a fix myself if not I'll check in a version of the devnet with a scenario which can be debugged.

It will probably be early next week.

@mh0lt
Copy link
Contributor Author

mh0lt commented Jun 10, 2023

@AskAlexSharov are you ok for me to promote this to ready for review ?

Copy link
Collaborator

@AskAlexSharov AskAlexSharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's ok to remove log line totally. Because we have --txpool.trace.senders flag - which enabling tracing of txs from given senders inside TxPool - which is more useful.

@AskAlexSharov AskAlexSharov changed the title Refactor SendRawTransaction to mitigate ReadCurrentBlockNumber nil return eth_sendRawTransaction: avoid reading blocks. remove existing logging, because have --txpool.trace.senders which enabling more verbose logging Jun 11, 2023
@AskAlexSharov AskAlexSharov changed the title eth_sendRawTransaction: avoid reading blocks. remove existing logging, because have --txpool.trace.senders which enabling more verbose logging eth_sendRawTransaction: remove logging, because we have --txpool.trace.senders which enabling more verbose logging for given senders list Jun 11, 2023
@AskAlexSharov AskAlexSharov marked this pull request as ready for review June 11, 2023 05:53
@AskAlexSharov AskAlexSharov merged commit f9acbff into devel Jun 11, 2023
@AskAlexSharov AskAlexSharov deleted the send_transaction_refactor branch June 11, 2023 06:18
@AskAlexSharov
Copy link
Collaborator

oookey... seems I understand: Released v2.45.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants