Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky console-subscriber integration tests #473

Open
hds opened this issue Oct 11, 2023 · 2 comments
Open

Flaky console-subscriber integration tests #473

hds opened this issue Oct 11, 2023 · 2 comments
Labels
C-subscriber Crate: console-subscriber. S-bug Severity: bug

Comments

@hds
Copy link
Collaborator

hds commented Oct 11, 2023

What crate(s) in this repo are involved in the problem?

console-subscriber

What is the issue?

The console-subscriber integration tests show flaky behavior locally.

The tests were originally introduced in #452.

How can the bug be reproduced?

Run all the framework tests locally. Sometimes the self_wakes test fails.

To reproduce this, I ran the tests in a loop until they failed. On my machine it typically takes under 5 minutes to produce a failure.

echo "Start!" && time while [ $? -eq 0 ]; do cargo test -p console-subscriber --test framework; done

Logs, error output, etc

No response

Versions

  • main branch.
  • console-subscriber v0.2.0

Possible solution

It appears that the root cause for this issue is the one described in tokio-rs/tracing#2743.

Additional context

No response

Would you like to work on fixing this bug?

yes

@hds hds added the S-bug Severity: bug label Oct 11, 2023
hds added a commit that referenced this issue Oct 11, 2023
The `ConsoleLayer` uses the `Callsites` struct to store and check for
callsites for specific kinds of traces, for example spawn spans or waker
events.

`Callsites` stores a fixed size array of pointers to the `Metadata` for
each callsite and a length to indicate how many callsites it has
registered. The length and each individual pointer are stored in
atomics.

Since it is possible for these values to change individually, if a
callsite lookup fails, we check if the length of the array has changed
while we were checking the pointers, if it has, the lookup is started
again.

However, there is still a possible race condition. If the length
changes, but the lookup occurs before the callsite pointer is actually
written, then we may miss a callsite that is in the process of being
registered. In this case, the pointer which is loaded from the
`Callsites` array will be null.

This change adds a check for this case (null ptr), and reperforms the
lookup if it occurs.

This race condition was found while chasing down the source of #473. It
doesn't solve the flakiness, but it can reduce the likelihood of it
occuring, thus it is a mitigation only.

In reality, neither of these race condition checks should be needed, as
we would expect that `tracing` guarantees that `ConsoleLayer` completes
`register_callsite()` before `on_event()` or `new_span()` are called.
hds added a commit that referenced this issue Oct 11, 2023
A flakiness problem has been discovered with the `console-subscriber`
integration tests introduced in #452. Issue #473 is tracking the issue.

It has been observed that we only "miss" the wake operation event when
it comes from `yield_now()`, but not when it comes from a task that
yielded due to `sleep`, even when the duration is zero. it is likely
that this is due to nature of the underlying race condition.

This change removes all the calls to `yield_now()` from the `framework`
tests, except those where we wish to actually test self wakes.
@hds hds added the C-subscriber Crate: console-subscriber. label Oct 11, 2023
hds added a commit that referenced this issue Oct 25, 2023
A flakiness problem has been discovered with the `console-subscriber`
integration tests introduced in #452. Issue #473 is tracking the issue.

It has been observed that we only "miss" the wake operation event when
it comes from `yield_now()`, but not when it comes from a task that
yielded due to `sleep`, even when the duration is zero. it is likely
that this is due to nature of the underlying race condition.

This change removes all the calls to `yield_now()` from the `framework`
tests, except those where we wish to actually test self wakes.
Additionally, all the sleeps have been moved out into a separate function
which describes why we're using `sleep` instead of `yield_now` when
either of them would be sufficient.
@Rustin170506
Copy link
Collaborator

@hds
Copy link
Collaborator Author

hds commented Apr 10, 2024

These two tests are so flaky that I'm going to delete them until we fix the problem. From console-subscriber/tests/wake.rs:

#[test]
fn sleep_wakes() {
    let expected_task = ExpectedTask::default()
        .match_default_name()
        .expect_wakes(1)
        .expect_self_wakes(0);

    let future = async {
        sleep(Duration::ZERO).await;
    };

    assert_task(expected_task, future);
}

#[test]
fn double_sleep_wakes() {
    let expected_task = ExpectedTask::default()
        .match_default_name()
        .expect_wakes(2)
        .expect_self_wakes(0);

    let future = async {
        sleep(Duration::ZERO).await;
        sleep(Duration::ZERO).await;
    };

    assert_task(expected_task, future);
}

hds added a commit that referenced this issue Apr 10, 2024
We have identified that some of the `console-subscriber` integration
tests are flaky (#473). This appears to be a result of the following
issue in `tracing` tokio-rs/tracing#2743.

Flaky tests are really worse than no tests. So these tests will be
removed until the flakiness can be fixed.
hds added a commit that referenced this issue Apr 10, 2024
We have identified that some of the `console-subscriber` integration
tests are flaky (#473). This appears to be a result of the following
issue in `tracing` tokio-rs/tracing#2743.

Flaky tests are really worse than no tests. So these tests will be
removed until the flakiness can be fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-subscriber Crate: console-subscriber. S-bug Severity: bug
Projects
None yet
Development

No branches or pull requests

2 participants