Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Set DEFAULT_LOWEST_INVALID_KNOWN_NONCE_CACHE configurable, like others transaction pool options #6058

Closed
tzdesing opened this issue Oct 19, 2023 · 21 comments · Fixed by #6148
Labels
non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT TeamChupa GH issues worked on by Chupacabara Team

Comments

@tzdesing
Copy link

Description

We are In a private free gas chain , using QBFT protocol, to implement a national CBDC, when a trasanction get stucked in TX_POOL, that transactions are included in cache. We are unable to send a new transaction with higher gas, so the only way is set our TX_POOL_MAX_SIZE to 0 , and back to default value, then we are able to sent new transactons, but , the invalid transaction was included in cache of validator and we are unable to reset then. because of network rules. So always we sent new transactions BESU will check this cache, and invalidate again because we have a higher nonce.

Acceptance Criteria

IF we are able to set LOWEST_INVALID_KNOWN_NONCE_CACHE = 0 on bootstrap of validators nodes, we can transact normally and transactions will not being invalidated every time, because we haven't a cache, in our case this cache isn't necessary.

@alexcostars
Copy link

We are facing the same problem. In our case, we have a lot of accounts that as blocked (all transactions signed by these accounts are getting stuck in TX_POOL).

One other option is a command line option that disables invalid nonce cache control.

@matthew1001
Copy link
Contributor

I raised a discussion on discord about exactly this issue a couple of days ago: https://discord.com/channels/905194001349627914/905205502940696607/1164623046389272699

I was considering a PR that would prevent invalid TXs entering the invalid nonce cache if the reason was NONCE_TOO_LOW (which is the reason I've seen it happening with). The NONCE_TOO_LOW error seems to be the least sensible reason to set the invalid nonce cache to that value - because by definition, you couldn't submit a new TX at that nonce anyway.

I'll potentially raise that PR, but I think the new layered pool implementation doesn't use an invalid nonce cache. So you could set --tx-pool=layered and see if that prevents the invalid nonce cache from blocking up the pool.

@matthew1001
Copy link
Contributor

I've raised #6067 which I think could help in a large number of the cases.

@matthew1001
Copy link
Contributor

Note that I think this issue which I raised yesterday is closely related to the scenario described:

#6043

Currently even if you set --tx-pool-price-bump=0 you cannot replace existing transactions. I think for QBFT, permissioned deployments it is very difficult to manage the pool through set its max size, or its retention time, when problematic transactions make their way into the pool.

@alexcostars
Copy link

@matthew1001 as you mentioned, there is a new layered pool implementation. Has this new implementation already been released?

@matthew1001
Copy link
Contributor

@matthew1001 as you mentioned, there is a new layered pool implementation. Has this new implementation already been released?

Yes it's available in 23.10.0 as the default pool implementation.

@matthew1001
Copy link
Contributor

Before 23.10.0 you need to set --Xlayered-tx-pool=true to use it

@non-fungible-nelson non-fungible-nelson added the non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT label Oct 23, 2023
@fab-10
Copy link
Contributor

fab-10 commented Oct 24, 2023

the invalid known nonce cache was introduced as a workaround to mitigate the nonce gap problem while we where introducing the new layered txpool, that solves this problem in a much better way. So if you are running Besu version >= 23.4.1 I strongly recommend to use it instead of the legacy one. Since version 23.10.0 it is the default

Said that we see that there are some specific use cases for private networks that are not supported, and there is the opportunity to create a txpool tailored for them.

@alexcostars
Copy link

Excellent advice. We will enable this option, but we are using 23.4.1 version and there is no documented option related --Xlayered-tx-pool=true (https://besu.hyperledger.org/23.4.1/public-networks/reference/cli/options and https://besu.hyperledger.org/23.4.1/private-networks/reference/cli/options), then I suggest to include this command line in the official documentation.

There is an equivalent ENV variable to set this config?

@fab-10
Copy link
Contributor

fab-10 commented Oct 24, 2023

It is not documented in that version, because we are not documenting experimental or hidden features, but you can read about the options, directly in the PR #5290

for the env variable try BESU_XLAYERED_TX_POOL=true

@fab-10
Copy link
Contributor

fab-10 commented Oct 25, 2023

FYI: This PR is an improvement to allow for easier tx replacement

@alexcostars
Copy link

@fab-10,
We recently ran into this issue on one of our addresses (caching invalid nonces).

I understand and agree with your suggestion to enable the parameters (or migrate to 23.10), but we are trying to reproduce the error to document and justify the configuration change, but we are not succeeding. When we didn't want to generate the error it occurred, but now that we want to generate it we can't.

Could you help us understand how we can reproduce the nonce gap scenario in version 23.4? Except network problems and spam transactions (mentioned in the PR), is there any practical way to cause the inclusion of an address in the cache of invalid nonces?

@fab-10
Copy link
Contributor

fab-10 commented Nov 3, 2023

@alexcostars could you share the configuration and parameters used to start Besu? it will also help to know about your setup, how many nodes and from which nodes you send the txs.

@alexcostars
Copy link

Here we have the genesis file (https://github.com/bacen/pilotord-kit-onboarding/blob/main/genesis.json) and conf (https://github.com/bacen/pilotord-kit-onboarding/blob/main/config.toml).

We are connected to 20 peers, 14 full nodes and 6 validators.

@siladu siladu added the TeamChupa GH issues worked on by Chupacabara Team label Nov 6, 2023
@alexcostars
Copy link

Hi, 2 weeks ago we faced this error, and the solution was to restart all validator nodes (because Invalid Nonce Cache is controlled in memory).

Today the error occurred again. We have, now, an account blocked (possibly setted as invalid nonce cache into validators). There is anything that I can do to understand why is this occurring?

@fab-10
Copy link
Contributor

fab-10 commented Nov 8, 2023

@alexcostars unfortunately with your version is not possible to do much, the only thing that comes to mind is to check the metrics, but also there there are limits.

About your request to reproduce the issue, if you were on a gas priced network, then I would have suggested to play with the balances of the account, to simulate it, but since in your case you are on a gas free network, it is not easy to manually simulate the issue, because if the invalid reason is NONCE_TOO_LOW, that depends on the timing of events.

NONCE_TOO_LOW is triggered when a validator is trying to create a block, and it selects a tx with a nonce that is already present in the world state for that sender, namely already confirmed, and this should happens if some events overlaps as detailed below:

Normal flow

  1. Tx1 and Tx2 for Sender1 are sent (Sender1 nonce=0)
  2. Block1 with only Tx1 is created
  3. Block1 with Tx1 is imported -> nonce increased (Sender1 nonce=1)
  4. Tx1 is now confirmed and removed from the pool
  5. Block2 with Tx2 is produced

the issue is when 5. happens before 4., that is, when a new block is being built, before the removal of confirmed txs from the pool is complete, as below:

  1. Tx1 and Tx2 for Sender1 are sent (Sender1 nonce=0)
  2. Block1 with only Tx1 is created
  3. Block1 with Tx1 is imported -> nonce increased (Sender1 nonce=1)
    4a. Block2 creation starts with Tx1 and Tx2 in the txppol (Tx1 is invalid since already present in the state) and flagged with NONCE_TOO_LOW and inserted in the invalid nonce cache, and by consequence also Tx2 is removed since it has an invalid tx with a lower nonce. Block2 is then created with no tx from that sender
    5a. Tx1 is now confirmed and removed from the pool

So this case is hard to reproduce in a deterministic way, at some point you need to upgrade to a newer version where, with the help of @matthew1001, we are improving on this part.

@alexcostars
Copy link

Thanks for describing the scenario, we know that it's hard to reproduce it.

I have 2 questions:

  1. In step 4a, when Besu created Block2 and faced that the Tx1 is already present in the state, why validator not delete Block2 from local and trigger a new block producer (what is de gain of including this account on Invalid Cache Nonce?)? Or Block2 is already broadcasted?

  2. If this scenario occurs in a validator that has --Xlayered-tx-pool=true seeted, what happens?

@fab-10
Copy link
Contributor

fab-10 commented Nov 8, 2023

@matthew1001 found another race condition that results in chain head not available error, that he is tracking here , and could be also a cause of your errors.

About your questions:

  1. Block2 is created because, it selects txs from the txpool content has at that moment, the fact that tx1 is not accepted has no effect on block building, like any other tx
  2. This scenario could also happen with the layered txpool, but the layered txpool does not have the invalid nonce cache, so you should be able to send other txs from the same sender

@rsarres
Copy link

rsarres commented Nov 8, 2023

@fab-10
Please evaluate my understanding of the issue:
1- The error LOWER_NONCE_INVALID_TRANSACTION_EXISTS means that the node is rejecting the tx to enter its txpool because it thinks there is an already confirmed tx with a higher nonce on the blockchain.
1a- If the above understanding is correct, the text of the error is misleading. I personally understood that a invalid transaction with a lower nonce exists on the blockchain but I couldn´t understand how an invalid transaction got into the blockchain. After all, even reverted txs are valid txs.

2- There is a bug on besu code that is misaligning the nonce cache and the actual blockchain data/world state. Somehow the nonce cache gets higher than it should and txs with the correct nonces are considered invalid (nonce already used)
2a- If the above is correct, I do not see how the inversion of steps 4 and 5 of the normal flow could create a misalignment of the nonce cache. Block 2 may be created on step 4a, but it should never be imported because it contains a replayed tx.

@fab-10
Copy link
Contributor

fab-10 commented Nov 9, 2023

@rsarres

  1. no the LOWER_NONCE_INVALID_TRANSACTION_EXISTS means that a tx with lower nonce was considered invalid, for any reason, not only nonce to low. So the sequence of the events is like that:

    • tx1 with nonce 1 is invalid, could be any error including NONCE_TOO_LOW, and it is added to the invalid cache
    • tx2 with nonce 2 arrived, but since tx1 is present in the invalid cache it is refused with LOWER_NONCE_INVALID_TRANSACTION_EXISTS
  2. I elaborated the point 4a, as below, since it was not clear:

4a. Block2 creation starts with Tx1 and Tx2 in the txpool (Tx1 is invalid since already present in the state) and flagged with NONCE_TOO_LOW and inserted in the invalid nonce cache, and by consequence also Tx2 is removed since it has an invalid tx with a lower nonce. Block2 is then created with no tx from that sender

Said that, I reckon that the invalid nonce cache, was a quick workaround, specifically crafted for public networks, to avoiding that the txpool was polluted with spam txs, while we were developing the layered txpool, that is built to solve this issue at the core.
Now that for public networks we have the layered txpool, and this invalid nonce cache is creating many problems in private networks, I think that it is the right time to get rid of this workaround, and remove the invalid nonce cache and I will propose a PR for that.

@rsarres
Copy link

rsarres commented Nov 10, 2023

@fab-10
Thank you, it is much clear now. I had not understood that invalid txs cache was a tx spam control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non mainnet (private networks) not related to mainnet features - covers privacy, permissioning, IBFT2, QBFT TeamChupa GH issues worked on by Chupacabara Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants