Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] Do not merge: Subxt UnstableBackend: Timeout issue repro #1318

Closed
wants to merge 31 commits into from

Conversation

lexnv
Copy link
Collaborator

@lexnv lexnv commented Dec 12, 2023

No description provided.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv requested review from a team as code owners December 12, 2023 14:37
This reverts commit 63aa514.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv
Copy link
Collaborator Author

lexnv commented Dec 13, 2023

The issue has been reproduced at: https://github.com/paritytech/subxt/actions/runs/7184693579/job/19591409961

Potential repro factors:

  • memory logging
  • stress CI with all steps

@lexnv
Copy link
Collaborator Author

lexnv commented Dec 13, 2023

The lightclient test might be failing because of: smol-dot/smoldot#1442

From CI https://github.com/paritytech/subxt/actions/runs/7193534907/job/19592205260?pr=1318

running 1 test
test light_client::light_client_testing has been running for over 60 seconds
Error: The operation was canceled.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv
Copy link
Collaborator Author

lexnv commented Dec 14, 2023

Logs reproducing the timeout issue:

MemLog: [
    (
        39.333µs,
        Validated,
    ),
    (
        2.845179181s,
        BestChainBlockIncluded {
            block: Some(
                TransactionBlockDetails {
                    hash: 0x59d21905fc5e145ac3c43d9629e978b65e5cfb9abd42d57d76cef1b87f77d152,
                    index: 1,
                },
            ),
        },
    ),
    (
        10.597509211s,
        Finalized {
            block: TransactionBlockDetails {
                hash: 0x59d21905fc5e145ac3c43d9629e978b65e5cfb9abd42d57d76cef1b87f77d152,
                index: 1,
            },
        },
    ),
]
SeenBlocksLog: {}

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv
Copy link
Collaborator Author

lexnv commented Dec 14, 2023

Extended logs https://github.com/paritytech/subxt/actions/runs/7207617809/job/19641867036:

MemLog: [
    (
        31.979µs,
        Validated,
    ),
    (
        2.165150812s,
        BestChainBlockIncluded {
            block: Some(
                TransactionBlockDetails {
                    hash: 0xe7f24f26a34447d8755302bdc90f0791be4da9a513d137899ac2bff6b10768b6,
                    index: 1,
                },
            ),
        },
    ),
    (
        9.332634314s,
        Finalized {
            block: TransactionBlockDetails {
                hash: 0xe7f24f26a34447d8755302bdc90f0791be4da9a513d137899ac2bff6b10768b6,
                index: 1,
            },
        },
    ),
]
SeenBlocksLog: {
    0xe3f48450de3ada88eac7fa9ba4424cd93402ab87c15d4caaf456ad3fd7e44573: (
        Finalized,
        BlockRef {
            inner: BlockRefInner {
                hash: 0xe3f48450de3ada88eac7fa9ba4424cd93402ab87c15d4caaf456ad3fd7e44573,
                unpin_flags: Mutex {
                    data: {
                        0xe7f24f26a34447d8755302bdc90f0791be4da9a513d137899ac2bff6b10768b6,
                    },
                    poisoned: false,
                    ..
                },
            },
        },
    ),
    0x96a65c7ee44ca9603871abd79e79692976c16a28b9337a32feb1ba12fddcf35b: (
        Finalized,
        BlockRef {
            inner: BlockRefInner {
                hash: 0x96a65c7ee44ca9603871abd79e79692976c16a28b9337a32feb1ba12fddcf35b,
                unpin_flags: Mutex {
                    data: {
                        0xe7f24f26a34447d8755302bdc90f0791be4da9a513d137899ac2bff6b10768b6,
                    },
                    poisoned: false,
                    ..
                },
            },
        },
    ),
    0x6bee1e7f587337285ec158c4b559043d43cecb2786433b368eb4cf87ea9a3faa: (
        New,
        BlockRef {
            inner: BlockRefInner {
                hash: 0x6bee1e7f587337285ec158c4b559043d43cecb2786433b368eb4cf87ea9a3faa,
                unpin_flags: Mutex {
                    data: {
                        0xe7f24f26a34447d8755302bdc90f0791be4da9a513d137899ac2bff6b10768b6,
                    },
                    poisoned: false,
                    ..
                },
            },
        },
    ),
    0x9a3850fcf259fa6a7fe6d2e4dc46c49155fd5e1ae5f4adecf6ac9df7f6d5c5a1: (
        New,
        BlockRef {
            inner: BlockRefInner {
                hash: 0x9a3850fcf259fa6a7fe6d2e4dc46c49155fd5e1ae5f4adecf6ac9df7f6d5c5a1,
                unpin_flags: Mutex {
                    data: {
                        0xe7f24f26a34447d8755302bdc90f0791be4da9a513d137899ac2bff6b10768b6,
                    },
                    poisoned: false,
                    ..
                },
            },
        },
    ),
}

From the logs, the block hash 0xe7f24f26a34447d8755302bdc90f0791be4da9a513d137899ac2bff6b10768b6 is present in the unpin flag data. Which means the block has been dropped from the chainHead subscription

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv added 2 commits January 8, 2024 18:09
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv
Copy link
Collaborator Author

lexnv commented Jan 8, 2024

Logs from https://github.com/paritytech/subxt/actions/runs/7450115302/job/20268303243?pr=1318:

MemLog: [
    (
        47.158µs,
        Validated,
    ),
    (
        1.23551272s,
        BestChainBlockIncluded {
            block: Some(
                TransactionBlockDetails {
                    hash: 0x5b035c49782c71635672af2dc8fcdbe0da80a56cc8a3b313359a22a3c16e6247,
                    index: 1,
                },
            ),
        },
    ),
    (
        8.134685163s,
        Finalized {
            block: TransactionBlockDetails {
                hash: 0x5b035c49782c71635672af2dc8fcdbe0da80a56cc8a3b313359a22a3c16e6247,
                index: 1,
            },
        },
    ),
]




SeenBlocksLog: {
    0xbcd3b1b6b9235f303e726ee85fd2ea6f887392025b2f054123b586e872b8c49d: (
        New,
        BlockRef {
            inner: BlockRefInner {
                hash: 0xbcd3b1b6b9235f303e726ee85fd2ea6f887392025b2f054123b586e872b8c49d,
                unpin_flags: Mutex {
                    data: {},
                    poisoned: false,
                    ..
                },
            },
        },
    ),
    0xd3da03a16da3dda362ec954a43681adc99864387fc0824dda2fe9fcdf70aaaa0: (
        Finalized,
        BlockRef {
            inner: BlockRefInner {
                hash: 0xd3da03a16da3dda362ec954a43681adc99864387fc0824dda2fe9fcdf70aaaa0,
                unpin_flags: Mutex {
                    data: {},
                    poisoned: false,
                    ..
                },
            },
        },
    ),
    0x5b035c49782c71635672af2dc8fcdbe0da80a56cc8a3b313359a22a3c16e6247: (
        New,
        BlockRef {
            inner: BlockRefInner {
                hash: 0x5b035c49782c71635672af2dc8fcdbe0da80a56cc8a3b313359a22a3c16e6247,
                unpin_flags: Mutex {
                    data: {},
                    poisoned: false,
                    ..
                },
            },
        },
    ),

The transaction class reports 0x5b035c49782c71635672af2dc8fcdbe0da80a56cc8a3b313359a22a3c16e6247; however we only encounter the block as New. Might be an artifact of timing out the test too early

lexnv added 9 commits January 8, 2024 18:28
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv added 4 commits January 9, 2024 17:47
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv
Copy link
Collaborator Author

lexnv commented Jan 9, 2024

Other logs form failures above:

tx:
2.8s 0x223f8682299015d50dd00165ce89409da6025eec8257d3bb2b3120f12fe592f2
10s: 0x223f8682299015d50dd00165ce89409da6025eec8257d3bb2b3120f12fe592f2

chainhead:
init: 0xfd337c5cecaa60e83ab9ee8eec1cdbba64fd1495648f6e2889807886ded918a3

18us
new: 0x924bb93101b733a14dc142036cf053f4e95d616616909a7952836a34ba850f43
    parent: 0xfd337c5cecaa60e83ab9ee8eec1cdbba64fd1495648f6e2889807886ded918a3

22.4us
new: 0xcd73e9a9969d84ad32d9dbf3532bfe9d18e72481abdf4e61ad9584c685f42bbc
    parent: 0x924bb93101b733a14dc142036cf053f4e95d616616909a7952836a34ba850f43

2.8
new: 0x223f8682299015d50dd00165ce89409da6025eec8257d3bb2b3120f12fe592f2
    parent: 0xcd73e9a9969d84ad32d9dbf3532bfe9d18e72481abdf4e61ad9584c685f42bbc

4.7s
fin: 0x924bb93101b733a14dc142036cf053f4e95d616616909a7952836a34ba850f43

5.8s
new: 0xd9fd435f5a764bc1b3e14993694b24780a1f16215c6b56822b0df9d8d4b1515d
    parent: 0x223f8682299015d50dd00165ce89409da6025eec8257d3bb2b3120f12fe592f2

7.4s
fin: 0xcd73e9a9969d84ad32d9dbf3532bfe9d18e72481abdf4e61ad9584c685f42bbc

8.8s
new: 0xe815e7d6a8a3eb9956969e731b3f8896f4c3b5593a083a52ddbdb291ded6cdd9
    parent: 0xd9fd435f5a764bc1b3e14993694b24780a1f16215c6b56822b0df9d8d4b1515d

Issues

  • The chainHead subscription does not produce any other blocks or events after around 9 seconds.
  • The finalized block reproted by tx is never reported by the chainHead as Finalized
  • FollowEvent::Finalized is rerported before FollowEvent::NewBlock (and this patch captures the error pretty often)

Added a couple of tests for expected order of blocks:

  • with raw chainHead_unstable_follow RPC method
  • with UnstableBackend follow_handler.subscription().events()

However, I could not reproduce the issue locally (the 2 tests mentioned above are passing locally)

lexnv added 3 commits January 10, 2024 11:08
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@jsdw
Copy link
Collaborator

jsdw commented Jan 10, 2024

Thankyou for continuing to dig into this @lexnv! Do you have any ideas on what to try next to get to the bottom of the issue?

Its weird that we can't reproduce it locally; I wonder what might be different in CI nodes which leads to it.

If the issue seems to be on the Smoldot side, then let's raise an issue there to discuss the problem and provide steps to reproduce (even if only in CI and not reliably; whatever the best we can do is). @tomaka might be able to help us get to the bottom of it faster :)

@lexnv
Copy link
Collaborator Author

lexnv commented Jan 10, 2024

This timeout is related to our Unstable Backend.
After looking at the code and coupling it with the fact that the Finalized event is delivered before NewBlock event, I believe there's an issue wrt multiplexing chainHead_follow subscription in subxt that comes from:

shared.block_events_from_last_finalized.clear();

The issue that handles the light-client is: #1346; I'll use a git-dependency for smoldot to see if the issue is solved on the main branch.

@lexnv lexnv changed the title [DNM] Do not merge: Timeout issue repro [DNM] Do not merge: Subxt UnstableBackend: Timeout issue repro Jan 10, 2024
@lexnv
Copy link
Collaborator Author

lexnv commented Jan 10, 2024

Considering the small amount of memory logs collected, I expect the issue is related to a second subscription to chainHead. This happens when we have a test that submits two transactions.
The chainHead subscription is used to correlate the blocks reported by the transaction with pinned blocks.

Tests Analysis

The flaky issue of timeout happens for different tests:

However, with this patch (panic on out of order events) the issue reproduce constantly for
https://github.com/paritytech/subxt/actions/runs/7463000473/job/20306660770 - full_client::blocks::decode_signed_extensions_from_blocks.

With that patch, I was able to reproduce the issue locally for decode_signed_extensions_from_blocks.

Issue Timeline

Timeline detected by panicking when receiving a Finalized event before a NewBlock event:

0x1 ------ 0x2 ------ 0x3 ------ 0x4 ------ 0x5 ------ 0x6 ------ 0x7      ------ 0x8
T0 init       new        new        new
T1            fin
T2                                             new
T3                       fin
T4                                                        new
T5                                  fin
T6                                                                new
T5                                              fin


Second Subscription
                                           T0  init
                                           T1                                     new
                                           T2             fin

Root Cause

For the Second subscription:

  • block 0x5 is reported by the Initial event as finalized
  • blocks 0x6 and 0x7 are not reported as NewBlocks, although they are descendants of 0x5
    • blocks 0x6 and 0x7 were reported by the first subscription
  • block 0x8 is reported as new block (descendant of 0x7)
  • block 0x6 is reported as finalized, but never as NewBlock

The blocks 0x6 and 0x7 are not reported because we are only keeping a window of events from the last finalized to the present moment. However this approach will miss the descendant blocks already encountered of the finalized block:

// Keep a buffer of all events from last finalized block so that new
// subscriptions can be handed this info first.
block_events_from_last_finalized: VecDeque<FollowEvent<BlockRef<Hash>>>,

@jsdw
Copy link
Collaborator

jsdw commented Jan 10, 2024

Aah, I think I see now, thank you!

so when you start a subscription, you're told about all new current new blocks first and then get the regular events in.

when you start a second follow subscription in subxt, it reuses the existing one and does something like reports the current finalised block etc that have been cached, before reporting back the underlying subscription events. so it sounds like that logic has an issue!

Edit: maybe the issue is here:

shared.block_events_from_last_finalized.clear();

We clear all events when we see a new finalised block but whoops, that may include some new block events for blocks that have yet to be finalized! We should probably do something like clear events only for blocks that have been mentioned in that finalised event or something (taking care to avoid letting an endless stream of old events that we aren't clearing hit that are no longer relevant accumulate, so think about how all events need to be handled which may end up leading to the cached event structure changing a bit as needed) :)

maybe you knew this all already @lexnv and I'm just repeating things; sorry if so :) im on my phone so only skimming things hehe

Edit 2:

The blocks 0x6 and 0x7 are not reported because we are only keeping a window of events from the last finalized to the present moment. However this approach will miss the descendant blocks already encountered of the finalized block

ah yup, you'd already seen this all :)

lexnv added 2 commits January 10, 2024 19:04
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv
Copy link
Collaborator Author

lexnv commented Jan 15, 2024

Continued at: #1358

@lexnv lexnv closed this Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants