Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTXO hang on first finalized testnet block #1406

Closed
teor2345 opened this issue Nov 30, 2020 · 6 comments
Closed

UTXO hang on first finalized testnet block #1406

teor2345 opened this issue Nov 30, 2020 · 6 comments
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug

Comments

@teor2345
Copy link
Contributor

teor2345 commented Nov 30, 2020

Version

zebrad main branch 2020-11-30.

Platform

Linux ... 5.4.75 #1-NixOS SMP Thu Nov 5 10:43:38 UTC 2020 x86_64 GNU/Linux

Description

When I run zebrad on Testnet, it often fails to sync any blocks past Sapling. When it does sync blocks, it only syncs 99 blocks, then times out on the UTXO for the 100th block.

This error happens at block 356314 (finalized tip) and block 356413 (non-finalized tip). The block that times out on UTXOs is probably 356414.

Analysis

This error is most likely an issue with UTXO handling in the finalized state, where the finalized tip 356314's UTXOs aren't being saved, or aren't being sent to the verifier for block 356414.

Since the block download timeout is 20 seconds, and there can be up to 4 retries, it should impossible for the UTXO request to timeout (600s) before a block download fails with 4 timeouts (80s). There would have to be a UTXO chain of 8 slow-but-successful blocks (8 * 80s = 640s) to exceed the UTXO timeout.

I have seen similar issues on Mainnet, but they are harder to reproduce. This Testnet issue is reproducible after a restart.

@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code S-needs-triage Status: A bug report needs triage labels Nov 30, 2020
@teor2345 teor2345 added this to the First Alpha Release milestone Nov 30, 2020
@teor2345
Copy link
Contributor Author

teor2345 commented Dec 2, 2020

This issue might be closed by #1425, we should re-test after it merges

@hdevalence
Copy link
Contributor

@teor2345 if you can reproduce this on testnet, after merging #1425 and #1423, would it be possible to run a sync with filter = 'trace' and save a complete execution trace? It will likely be quite large, but we'd have the ability to trace exactly what data is causing the timeout.

@teor2345
Copy link
Contributor Author

teor2345 commented Dec 2, 2020

I'll also want #1439, just making sure it passes tests now

@teor2345
Copy link
Contributor Author

teor2345 commented Dec 3, 2020

Just updating this issue: I still haven't had a chance to do diagnostics, I need to implement #1413 first

@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Dec 3, 2020
@teor2345
Copy link
Contributor Author

teor2345 commented Dec 4, 2020

I just saw this hang again, when transitioning from the checkpoint verifier to the block verifier on testnet.

@teor2345
Copy link
Contributor Author

teor2345 commented Dec 4, 2020

Looking at the debug output, I'm not convinced this issue is due to UTXOs. It seems like a duplicate of #1435.

@teor2345 teor2345 closed this as completed Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug
Projects
None yet
Development

No branches or pull requests

3 participants