Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clear_data can delete the same transaction twice, resulting in debug_assert #2906

Closed
SkidanovAlex opened this issue Jun 26, 2020 · 1 comment · Fixed by #2924
Closed

clear_data can delete the same transaction twice, resulting in debug_assert #2906

SkidanovAlex opened this issue Jun 26, 2020 · 1 comment · Fixed by #2924
Assignees
Labels
A-chain Area: Chain, client & related C-bug Category: This is a bug

Comments

@SkidanovAlex
Copy link
Collaborator

SkidanovAlex commented Jun 26, 2020

I can only reproduce it under stress. But possibly we can figure out why that happens without having a lightweight repro, just by inspecting the logic.

The observed behavior:

Debug assert with "Transaction overwrites itself" (it is inside StoreUpdate::commit), that shows in the trace the following line twice (and multiple other similar lines, each occurring twice):

- ColTransactions 5rrivziGwNuAsKp5WqE1Z7Eh5oCPKZ6YtYL6sfZZz8Xw

Indicating that the same transaction was cleaned up twice.
The stacktrace indicates that the commit is invoked from within clear_data.

If you do want to reproduce, check out stress_fixed branch and run

python tests/stress/stress.py 3 3 3 0 staking transactions node_restart

It reproduces relatively consistently.

@SkidanovAlex SkidanovAlex added the A-chain Area: Chain, client & related label Jun 26, 2020
@ilblackdragon ilblackdragon added the C-bug Category: This is a bug label Jun 29, 2020
@Kouprin
Copy link
Member

Kouprin commented Jun 30, 2020

As the same tx may be included in several chunks, it's possible to delete the tx as many times as amount of Forks we currently have.
The solution is to store refcount on tx usage.

SkidanovAlex added a commit that referenced this issue Jul 8, 2020
After this change stress.py passes consistently, and is reintroduced to nightly.

Nearcore fixes:

[v] We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[ ] Removing the old infrastructure for network interference, which relied on
certain node setup, and instead using the new network proxy infrastructure.

[ ] Adding a new argument that controls the percentage of dropped messages between nodes.

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

[ ] Altering `node_restart` process to ocassionally wipe out the data folder of
the node, so that we stress state sync (and syncing in general) more

Other issues discovered while fixing stress.py:
- #2906
SkidanovAlex added a commit that referenced this issue Jul 8, 2020
After this change stress.py passes consistently, and is reintroduced to nightly.

Nearcore fixes:

[v] We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[ ] Removing the old infrastructure for network interference, which relied on
certain node setup, and instead using the new network proxy infrastructure.

[ ] Adding a new argument that controls the percentage of dropped messages between nodes.

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

[ ] Altering `node_restart` process to ocassionally wipe out the data folder of
the node, so that we stress state sync (and syncing in general) more

Other issues discovered while fixing stress.py:
- #2906
SkidanovAlex added a commit that referenced this issue Jul 24, 2020
After this change stress.py passes consistently, and is reintroduced to nightly.

Nearcore fixes:

[v] We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[ ] Removing the old infrastructure for network interference, which relied on
certain node setup, and instead using the new network proxy infrastructure.

[ ] Adding a new argument that controls the percentage of dropped messages between nodes.

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

[ ] Altering `node_restart` process to ocassionally wipe out the data folder of
the node, so that we stress state sync (and syncing in general) more

Other issues discovered while fixing stress.py:
- #2906
SkidanovAlex added a commit that referenced this issue Jul 30, 2020
After this change stress.py passes consistently, and is reintroduced to nightly.

Nearcore fixes:

[v] We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[ ] Removing the old infrastructure for network interference, which relied on
certain node setup, and instead using the new network proxy infrastructure.

[ ] Adding a new argument that controls the percentage of dropped messages between nodes.

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

[ ] Altering `node_restart` process to ocassionally wipe out the data folder of
the node, so that we stress state sync (and syncing in general) more

Other issues discovered while fixing stress.py:
- #2906
nearprotocol-bulldozer bot pushed a commit that referenced this issue Jul 31, 2020
…#3036)

After this change stress.py node_restart passes relatively consistently, and is reintroduced to nightly.

Nearcore fixes:

- We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run

- (#2916) we had an issue that if a node produced a chunk, and then
crashed, on recovery it was not able to serve it because it didn't have
all the parts and receipts stored in the storage from which we recover
cache entries in the shards manager. Fixing it by always storing all the
parts and receipts (redundantly) for chunks in the shards we care about.

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

Other issues discovered while fixing stress.py:
- #2906
bowenwang1996 pushed a commit that referenced this issue Aug 14, 2020
…#3036)

After this change stress.py node_restart passes relatively consistently, and is reintroduced to nightly.

Nearcore fixes:

- We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run

- (#2916) we had an issue that if a node produced a chunk, and then
crashed, on recovery it was not able to serve it because it didn't have
all the parts and receipts stored in the storage from which we recover
cache entries in the shards manager. Fixing it by always storing all the
parts and receipts (redundantly) for chunks in the shards we care about.

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

Other issues discovered while fixing stress.py:
- #2906
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-chain Area: Chain, client & related C-bug Category: This is a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants