Skip to content

Conversation

@kashitaka
Copy link
Contributor

@kashitaka kashitaka commented Jul 25, 2025

Fixes: #32252

This PR addresses a flakiness in the rollback test.

I found nonce collision caused transactions occasionally fail to send. I tried to change error message in the failed test like:

	if err = client.SendTransaction(ctx, signedTx); err != nil {
		t.Fatalf("failed to send transaction: %v, nonce: %d", err, signedTx.Nonce())
	}

and I occasionally got test failure with this message:

=== CONT  TestFlakyFunction/Run_#100
    rollback_test.go:44: failed to send transaction: already known, nonce: 0
--- FAIL: TestFlakyFunction/Run_#100 (0.07s)

Although nonces are obtained via PendingNonceAt, we observed that, in rare cases (approximately 1 in 1000), two transactions from the same sender end up with the same nonce. This likely happens because tx0 has not yet propagated to the transaction pool before tx1 requests its nonce. When the test succeeds, tx0 and tx1 have nonces 0 and 1, respectively. However, in rare failures, both transactions end up with nonce 0.

Fix

We modified the test to explicitly assign nonces to each transaction. By controlling the nonce values manually, we eliminated the race condition and ensured consistent behavior. After several thousand runs, the flakiness was no longer reproducible in my local environment.

Additional

Reduced internal polling interval in pendingStateHasTx() to speed up test execution without impacting stability. It reduces test time for TestTransactionRollbackBehavior from about 7 seconds to 2 seconds.

@kashitaka kashitaka requested a review from fjl as a code owner July 25, 2025 18:18
@rjl493456442
Copy link
Member

Can you modify the testSendSignedTx, providing the nonce manually instead of fetching from the pendingNonce?

So that you can get rid of the time.Sleep between the tx submission.

@kashitaka
Copy link
Contributor Author

@rjl493456442 Thank you for your feedback. I applied it in 74caa26

Can you modify the testSendSignedTx, providing the nonce manually instead of fetching from the pendingNonce?
So that you can get rid of the time.Sleep between the tx submission.

Agreed, that sounds like a more stable approach. I initially assumed the test specifically wanted to fetch the nonce via PendingNonceAt, so I didn’t modify that part. But if it’s okay to change, I agree it’s better this way.

Since we can’t modify the nonce on an already constructed tx, I didn’t change testSendSignedTx itself — instead, I updated newTx and newBlobTx to accept and use a manually provided nonce. Could you take a look and let me know if this approach looks legit?

)

// Poll for receipt with timeout
deadline := time.Now().Add(2 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for changing the wait time duration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI system is occasionally overloaded, have a loose duration is usually preferred

Copy link
Contributor Author

@kashitaka kashitaka Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjl493456442 Thanks for the comment! As mentioned in the PR description, It reduces test time for TestTransactionRollbackBehavior from about 7 seconds to 2 seconds.

I shortened the wait time because I ran the test thousands times to reproduce the flakiness, and here was the bottle neck. The test remained stable with the shorter duration, so I kept the change for some improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. This loop is for checking the availability of the receipt. Once it's found, the loop will be terminated.

If the CI can be finished within 2s, then setting it with 7s has no side effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure if this behavior is intentional or unexpected. In this line:

 	if !pendingStateHasTx(client, btx2) || !pendingStateHasTx(client, tx2) || !pendingStateHasTx(client, tx3) {

the polling loop exits almost immediately, since the receipt is available and TransactionReceipt() succeeds — just as you explained. However, in this part:

	if pendingStateHasTx(client, btx0) || pendingStateHasTx(client, tx0) || pendingStateHasTx(client, tx1) {

each call to pendingStateHasTx takes the full 2 seconds and takes 6 seconds in total, because:

receipt, err = client.TransactionReceipt(ctx, tx.Hash())

returns an error like "transaction indexing is in progress" and doesn't drop into break in the 2 sec polling, which seems to happen when querying uncommitted transactions. And that’s why I shortened the polling timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjl493456442 If you prefer a more relaxed polling interval, please let me know—I’m happy to revert that change.

@kashitaka kashitaka force-pushed the kashi/issue-32252 branch from 9a9b71f to 74caa26 Compare July 28, 2025 08:07
@rjl493456442
Copy link
Member

Thanks for your solid work!

@rjl493456442 rjl493456442 added this to the 1.16.2 milestone Jul 28, 2025
@rjl493456442 rjl493456442 merged commit eb7aef4 into ethereum:master Jul 28, 2025
4 of 8 checks passed
calbera added a commit to berachain/bera-geth that referenced this pull request Jul 30, 2025
* core/types: minimize this invalid intermediate state (ethereum#32241)

* core/rawdb: downgrade log level in chain freezer (ethereum#32253)

* triedb/pathdb:  use binary.append to eliminate the tmp scratch slice (ethereum#32250)

`binary.AppendUvarint` offers better performance than using append
directly, because it avoids unnecessary memory allocation and copying.

In our case, it can increase the performance by +35.8% for the
`blockWriter.append` function:

```
benchmark                        old ns/op     new ns/op     delta
BenchmarkBlockWriterAppend-8     5.97          3.83          -35.80%
```

---------

Signed-off-by: jsvisa <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>

* p2p/rlpx: optimize XOR operation using bitutil.XORBytes (ethereum#32217)

Replace manual byte-by-byte XOR implementation with the optimized
bitutil.XORBytes function. This improves performance by using word-sized
operations on supported architectures while maintaining the same
functionality. The optimized version processes data in bulk rather than
one byte at a time

---------

Co-authored-by: Felix Lange <fjl@twurst.com>

* eth/gasestimator: fix potential overflow (ethereum#32255)

Improve binary search, preventing the potential overflow in certain L2 cases

* triedb/pathdb: fix an deadlock in history indexer (ethereum#32260)

Seems the `signal.result` was not sent back in shorten case, this will
cause a deadlock.

---------

Signed-off-by: jsvisa <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>

* eth/protocols/snap: add healing and syncing metrics (ethereum#32258)

Adds the heal time and snap sync time to grafana

---------

Co-authored-by: Gary Rong <garyrong0905@gmail.com>

* core: replace the empty fmt.Errorf with errors.New (ethereum#32274)

The `errors.new` function does not require string formatting, so its
performance is better than that of `fmt.Errorf`.

* eth/catalyst: fix error message in ExecuteStatelessPayloadV4 (ethereum#32269)

Correct the error message in the ExecuteStatelessPayloadV4 function to
reference newPayloadV4 and the Prague fork, instead of incorrectly
referencing newPayloadV3 and Cancun. 

This improves clarity during debugging and aligns the error message with 
the actual function and fork being validated. No logic is changed.

---------

Co-authored-by: rjl493456442 <garyrong0905@gmail.com>

* cmd, eth, internal: introduce debug_sync (ethereum#32177)

Alternative implementation of ethereum#32159

* all: replace fmt.Errorf with errors.New (ethereum#32286)

The errors.new function does not require string formatting, so its
performance is better than that of fmt.Errorf.

* downloader: fix typos, grammar and formatting (ethereum#32288)

* ethclient/simulated: Fix flaky rollback test (ethereum#32280)

This PR addresses a flakiness in the rollback test discussed in
ethereum#32252

I found `nonce` collision caused transactions occasionally fail to send.
I tried to change error message in the failed test like:

```
	if err = client.SendTransaction(ctx, signedTx); err != nil {
		t.Fatalf("failed to send transaction: %v, nonce: %d", err, signedTx.Nonce())
	}
```

and I occasionally got test failure with this message:

```
=== CONT  TestFlakyFunction/Run_#100
    rollback_test.go:44: failed to send transaction: already known, nonce: 0
--- FAIL: TestFlakyFunction/Run_#100 (0.07s)
```

Although `nonces` are obtained via `PendingNonceAt`, we observed that,
in rare cases (approximately 1 in 1000), two transactions from the same
sender end up with the same nonce. This likely happens because `tx0` has
not yet propagated to the transaction pool before `tx1` requests its
nonce. When the test succeeds, `tx0` and `tx1` have nonces `0` and `1`,
respectively. However, in rare failures, both transactions end up with
nonce `0`.

We modified the test to explicitly assign nonces to each transaction. By
controlling the nonce values manually, we eliminated the race condition
and ensured consistent behavior. After several thousand runs, the
flakiness was no longer reproducible in my local environment.

Reduced internal polling interval in `pendingStateHasTx()` to speed up
test execution without impacting stability. It reduces test time for
`TestTransactionRollbackBehavior` from about 7 seconds to 2 seconds.

* core/state: preallocate capacity for logs list (ethereum#32291)

Improvement: preallocate capacity for `logs` at first to avoid
reallocating multi times.

* core/state: improve PrettyPrint function (ethereum#32293)

---------

Signed-off-by: jsvisa <delweng@gmail.com>
Co-authored-by: maskpp <maskpp266@gmail.com>
Co-authored-by: Delweng <delweng@gmail.com>
Co-authored-by: Gary Rong <garyrong0905@gmail.com>
Co-authored-by: Micke <155267459+reallesee@users.noreply.github.com>
Co-authored-by: Felix Lange <fjl@twurst.com>
Co-authored-by: gzeon <h@arry.io>
Co-authored-by: Marius van der Wijden <m.vanderwijden@live.de>
Co-authored-by: nthumann <nthumanna@gmail.com>
Co-authored-by: Galoretka <galoretochka@gmail.com>
Co-authored-by: ericxtheodore <ericxtheodore@outlook.com>
Co-authored-by: Tomás Andróil <tomasandroil@gmail.com>
Co-authored-by: kashitaka <takao.w9st.kashima@xica.net>
howjmay pushed a commit to iotaledger/go-ethereum that referenced this pull request Aug 27, 2025
This PR addresses a flakiness in the rollback test discussed in
ethereum#32252

I found `nonce` collision caused transactions occasionally fail to send.
I tried to change error message in the failed test like:

```
	if err = client.SendTransaction(ctx, signedTx); err != nil {
		t.Fatalf("failed to send transaction: %v, nonce: %d", err, signedTx.Nonce())
	}
```

and I occasionally got test failure with this message:

```
=== CONT  TestFlakyFunction/Run_#100
    rollback_test.go:44: failed to send transaction: already known, nonce: 0
--- FAIL: TestFlakyFunction/Run_#100 (0.07s)
```

Although `nonces` are obtained via `PendingNonceAt`, we observed that,
in rare cases (approximately 1 in 1000), two transactions from the same
sender end up with the same nonce. This likely happens because `tx0` has
not yet propagated to the transaction pool before `tx1` requests its
nonce. When the test succeeds, `tx0` and `tx1` have nonces `0` and `1`,
respectively. However, in rare failures, both transactions end up with
nonce `0`.

We modified the test to explicitly assign nonces to each transaction. By
controlling the nonce values manually, we eliminated the race condition
and ensured consistent behavior. After several thousand runs, the
flakiness was no longer reproducible in my local environment.

Reduced internal polling interval in `pendingStateHasTx()` to speed up
test execution without impacting stability. It reduces test time for
`TestTransactionRollbackBehavior` from about 7 seconds to 2 seconds.
maoueh pushed a commit to streamingfast/go-ethereum that referenced this pull request Sep 4, 2025
This PR addresses a flakiness in the rollback test discussed in
ethereum#32252

I found `nonce` collision caused transactions occasionally fail to send.
I tried to change error message in the failed test like:

```
	if err = client.SendTransaction(ctx, signedTx); err != nil {
		t.Fatalf("failed to send transaction: %v, nonce: %d", err, signedTx.Nonce())
	}
```

and I occasionally got test failure with this message:

```
=== CONT  TestFlakyFunction/Run_#100
    rollback_test.go:44: failed to send transaction: already known, nonce: 0
--- FAIL: TestFlakyFunction/Run_#100 (0.07s)
```

Although `nonces` are obtained via `PendingNonceAt`, we observed that,
in rare cases (approximately 1 in 1000), two transactions from the same
sender end up with the same nonce. This likely happens because `tx0` has
not yet propagated to the transaction pool before `tx1` requests its
nonce. When the test succeeds, `tx0` and `tx1` have nonces `0` and `1`,
respectively. However, in rare failures, both transactions end up with
nonce `0`.

We modified the test to explicitly assign nonces to each transaction. By
controlling the nonce values manually, we eliminated the race condition
and ensured consistent behavior. After several thousand runs, the
flakiness was no longer reproducible in my local environment.

Reduced internal polling interval in `pendingStateHasTx()` to speed up
test execution without impacting stability. It reduces test time for
`TestTransactionRollbackBehavior` from about 7 seconds to 2 seconds.
gballet pushed a commit to gballet/go-ethereum that referenced this pull request Sep 11, 2025
This PR addresses a flakiness in the rollback test discussed in
ethereum#32252

I found `nonce` collision caused transactions occasionally fail to send.
I tried to change error message in the failed test like:

```
	if err = client.SendTransaction(ctx, signedTx); err != nil {
		t.Fatalf("failed to send transaction: %v, nonce: %d", err, signedTx.Nonce())
	}
```

and I occasionally got test failure with this message:

```
=== CONT  TestFlakyFunction/Run_#100
    rollback_test.go:44: failed to send transaction: already known, nonce: 0
--- FAIL: TestFlakyFunction/Run_#100 (0.07s)
```

Although `nonces` are obtained via `PendingNonceAt`, we observed that,
in rare cases (approximately 1 in 1000), two transactions from the same
sender end up with the same nonce. This likely happens because `tx0` has
not yet propagated to the transaction pool before `tx1` requests its
nonce. When the test succeeds, `tx0` and `tx1` have nonces `0` and `1`,
respectively. However, in rare failures, both transactions end up with
nonce `0`.

We modified the test to explicitly assign nonces to each transaction. By
controlling the nonce values manually, we eliminated the race condition
and ensured consistent behavior. After several thousand runs, the
flakiness was no longer reproducible in my local environment.

Reduced internal polling interval in `pendingStateHasTx()` to speed up
test execution without impacting stability. It reduces test time for
`TestTransactionRollbackBehavior` from about 7 seconds to 2 seconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix flaky test TestEthClient

2 participants