Tracing extension doesn't work properly when used with async code #448

ibralu · 2021-03-19T19:07:16Z

Expected Behavior

Should build correct spans tree

Actual Behavior

Inconsistent span trees, some spans are missing parent.
Don't think span.with_subscriber(|(id, d)| d.enter(id)) works in multi threaded runtime

Steps to Reproduce the Problem

Clone this repo https://github.com/ibralu/async-graphql-tracing. I've used starwars example with sleep on resolvers
Start jaeger

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 14250:14250 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.22

Test with query

{
  hero(episode: EMPIRE) {
    id
    name
    friends {
      id
      name
    }
    appearsIn
  }
}

Specifications

Version: 2.6
Platform: macOS
Subsystem: 11.2

The text was updated successfully, but these errors were encountered:

sunli829 · 2021-03-20T02:38:56Z

Thanks for the test, I have fixed it in v2.6.1 version. 🙂

ibralu · 2021-03-20T03:08:33Z

Thank you for the fix, however, I don't think that would work though, as @D1plo1d has mentioned in #447 we need to wrap the future with instrument
https://github.com/tokio-rs/tracing#in-asynchronous-code

sunli829 · 2021-03-20T03:23:39Z

Thank you for the fix, however, I don't think that would work though, as @D1plo1d has mentioned in #447 we need to wrap the future with instrument
https://github.com/tokio-rs/tracing#in-asynchronous-code

I refer to the code implementation of instrument, in fact it is to call enter repeatedly in different threads.

https://github.com/tokio-rs/tracing/blob/0e8f920d4bc47643d9a035ed2bbe9504ec076ca4/tracing/src/instrument.rs#L148

sunli829 · 2021-03-20T03:46:27Z

Sorry, there still seems to be a problem.

sunli829 · 2021-03-20T09:24:37Z

I still haven't found a solution. 😂

sunli829 · 2021-03-20T10:29:33Z

I am going to switch to opentelemetry, tracing makes me feel very confused.

sunli829 · 2021-03-20T11:43:58Z

I added OpenTelemetry support to the master branch, and it works very well now.
The way tracing works in multithreading is very strange, I decided to give up support for it. @ibralu @D1plo1d

sunli829 · 2021-03-20T11:46:55Z

D1plo1d · 2021-03-20T14:31:11Z

@sunli829 I'm sorry to hear that. I'm not sure if this helps re: tracing but I think the key to their design is entering the span inside poll and dropping it at the end of poll.

Entering span anywhere else will not work because it will wrap all futures that execute on that thread until it is dropped.

The way they've built it every async thread is only inside the span for the period of a future's poll function - as soon as the poll function returns the drop handler exits the span.

sunli829 · 2021-03-20T14:37:57Z

I think there is no problem with my last modification, but it just doesn't work. This is related to cross-threading.

I ran into the same problem as described here, but the more troublesome thing is that I need a lot of Spans.
tokio-rs/tracing#1188

D1plo1d · 2021-03-20T15:20:30Z

My apologies if I'm missing something but looking at commit cf1fe6f it looks as though enter is being called outside of await. As I understand it this is not equivalent to instrument. As I understand it async_graphql currently handles spans like this:

Thread 1:

ExecutionFuture.poll starts
Execution Start (enter span)
Do some synchronous execution in the correct span
Hit some asynchronous IO
ExecutionFuture.poll returns but span is not dropped
Poll an unrelated future with the span still applied to the current thread.
[Various futures run correctly and incorrectly in the span]
Final call to ExecutionFuture.poll drops the span

How Instrument works:

ExecutionFuture.poll starts
Execution Start (enter span)
Do some synchronous execution in the correct span
Hit some asynchronous IO
ExecutionFuture.poll returns and the span is dropped (and thereby removed from the current thread).
Poll an unrelated future without the span
[Various futures run correctly with the span only applied during Execution future.poll]

ibralu · 2021-03-21T04:01:39Z

Thank you for your effort, I think same problem would happen with opentelemtry. Pls check the opentelemtry async example https://github.com/open-telemetry/opentelemetry-rust/tree/main/examples/async they call any_future.with_context() (I think same logic as tracing any_future.instrument()) so both opentelemetry and tracing won't work properly unless we actualy pass futures to extensions (e.g. Output::resolve() to resolve_start) and return wrapped version

sunli829 · 2021-03-21T14:02:17Z

I resumed the tracing extension and sorted out the code, but I still don't understand what is wrong.

sunli829 · 2021-03-22T02:42:25Z

After my repeated tests, there is no problem with tracing and opentelemetry. 😁 @ibralu @D1plo1d

sunli829 · 2021-03-22T04:42:45Z

Released in v2.6.3

ibralu · 2021-03-22T14:12:27Z

Hi, thank you😁 gonna try it this week

D1plo1d · 2021-03-25T03:40:17Z

I've just updated! 😄

D1plo1d · 2021-03-25T04:07:06Z

Hmm, do we still have a way to instrument websockets with a custom root span (eg. I'm using async_graphql_warp::graphql_subscription_with_data) now that TracingConfig.parent_span() has been replaced?

sunli829 · 2021-03-25T04:42:55Z

For a single request, it is easy to customize the parent span. But for subscriptions, asynchronous streaming is at a deeper level, maybe I should restore TracingConfig.parent_span.

let span = tracing::info_span!("root");
Ok::<_, Infallible>(Response::from(
    schema.execute(request).instrument(span).await,
))

sunli829 · 2021-03-26T04:08:55Z

I made a lot of changes, and now the Tracing extension is finally working great!

sunli829 · 2021-03-26T04:09:06Z

Released in v2.7.0

sunli829 · 2021-04-02T01:17:05Z

I think this issue has been resolved, if you find that it is still incorrect, please feel free to open it.

menixator · 2021-04-02T12:17:34Z

Tracing still won't work with async graphql.
We will have to use .instrument to ATTACH spans to futures so that when they are polled, the correct span gets entered and when poll returns, the span would be exited.

The current implementation is not at all correct because it drops the Entered struct at the end of enter_span functions.

https://github.com/async-graphql/async-graphql/blob/master/src/extensions/tracing.rs#L98

However, it is not as easy as simpling holding onto the return values from Span::enter() function. Since we are using futures here, every time the Future's poll() method returns, we have to exit the span so that another task will not log to the incorrect span. The Instrument trait helps out with this.

Instrument trait defines an instrument method on all futures that returns a wrapper struct for a future, and a span. Every time the Instrumented struct is polled, the attached span will be entered. When the inner future's poll returns, it will exit the span.

Following is a simplified demonstration of how async-graphql is trying to handle tracing vs how it should handle it.

use tracing::{span, Level};
use tracing_subscriber::{fmt::time::ChronoLocal, EnvFilter, FmtSubscriber};
use tokio::time::Duration;
use tracing::Instrument;

#[tokio::main]
async fn main() {
    let filter = EnvFilter::from_default_env();
    let subscriber = FmtSubscriber::builder()
        // all spans/events with a level higher than TRACE (e.g, debug, info, warn, etc.)
        // will be written to stdout.
        .with_env_filter(filter)
        // .json()
        //.pretty()
        .with_timer(ChronoLocal::rfc3339())
        // completes the builder.
        .finish();

    tracing::subscriber::set_global_default(subscriber).expect("setting default subscriber failed");
    let span = span!(Level::INFO, "my_tracing_demo");
    let _s = span.enter();

    
    // This is currently what async-graphql is doing right now
    tokio::task::spawn(async move{
        let span = span!(Level::INFO, "ONE");
        let handle = span.enter();
        example_async_fn_one().await
    });

    tokio::task::spawn(async move {
        let span = span!(Level::INFO, "TWO");
        let handle = span.enter();
        example_async_fn_two().await
    });
    tokio::time::sleep(Duration::from_secs(30)).await;

    
    
    /*
    // This is what it should be doing
    tokio::task::spawn(async move{
        let span = span!(Level::INFO, "ONE");
        example_async_fn_one().instrument(span).await
    });

    tokio::task::spawn(async move {
        let span = span!(Level::INFO, "TWO");
        example_async_fn_two().instrument(span).await
    });
    tokio::time::sleep(Duration::from_secs(30)).await;
    */
}


async fn example_async_fn_one(){
    for i in 0..10 {
        tracing::info!("hello from example one!");
        tokio::time::sleep(Duration::from_secs(1)).await;
    }
}

async fn example_async_fn_two(){
    for i in 0..10 {
        tracing::info!("hello from example two!");
        tokio::time::sleep(Duration::from_secs(2)).await;
    }
}

If you run the example above, without the instrument call, some of the traces will be output from the incorrect span.

So, to fix this, I think we will have to instrument the resolution futures within the library. But I don't think we can get away with that without making tracing a core part of async-graphql(Not that I am opposed to it). Ultimately, it's impossible to implement it as an extension.

sunli829 · 2021-04-02T12:52:04Z

So, to fix this, I think we will have to instrument the resolution futures within the library. But I don't think we can get away with that without making tracing a core part of async-graphql(Not that I am opposed to it). Ultimately, it's impossible to implement it as an extension.

In fact, my implementation is no different from instrument. If you can provide a simple demo to reproduce the problem you described, I would be happy to fix it. 🙂

sunli829 · 2021-04-02T13:12:54Z

Here is an example, it works very well.

https://github.com/sunli829/async-graphql-tracing

menixator · 2021-04-02T14:33:15Z

If I use tracing::{warn,trace,info...} within the resolvers, I would expect information that is attached to the resolution span such as id, path, parent_type and return_type to be present in every log message.

  fn resolve_start(&mut self, _ctx: &ExtensionContext<'_>, info: &ResolveInfo<'_>) {
        let parent = match info.resolve_id.parent {
            Some(parent_id) if parent_id > 0 => self.spans.get(&resolve_span_id(parent_id)),
            _ => self.spans.get(&EXECUTE_CTX),
        };

        if let Some(parent) = parent {
            let span = span!(
                target: "async_graphql::graphql",
                parent: parent,
                Level::INFO,
                "field",
                id = %info.resolve_id.current,
                path = %info.path_node,
                parent_type = %info.parent_type,
                return_type = %info.return_type,
            );
            self.enter_span(resolve_span_id(info.resolve_id.current), span);
        }
    }

However, if I modify the hero resolver in your example by adding a call to tracing::info!(), that data is not attached to the log message.

impl QueryRoot {
    async fn hero(
        &self,
        ctx: &Context<'_>,
        #[graphql(
            desc = "If omitted, returns the hero of the whole saga. If provided, returns the hero of that particular episode."
        )]
        episode: Episode,
    ) -> Character {
        tracing::info!("I expect contextual information to be attached to this log");
        tokio::time::sleep(std::time::Duration::from_secs(1)).await;
        if episode == Episode::Empire {
            Human(ctx.data_unchecked::<StarWars>().luke).into()
        } else {
            Droid(ctx.data_unchecked::<StarWars>().artoo).into()
        }
    }

When you're creating a span and entering it, the fields attached to the span will be present in any subsequent events so long as the return value from span.enter() is not dropped.

Eg:

Output:

Notice how request_path is present in both the log messages even though I didn't explicitly pass them to tracing::info. That's the part that's broken right now. Tracing would work as a basic logger as of now. But the full power of tracing cannot be utilized with async-graphql as it is right now.

Also, sorry about the delay. My laptop is a potato so compiling stuff takes a while lol.

sunli829 · 2021-04-02T15:53:00Z

I understand the problem you encountered. Because of the lack of the current span, tracing::{warn,trace,info...} can't work at all. I did neglect this problem, thank you for telling me!
I will carefully consider how to solve this problem. 🙂

D1plo1d · 2021-04-02T16:02:25Z

@sunli829 Super appreciate you having another look at this. I should have mentioned that my issues were also due to warn! calls inside my revolvers. My original proposal has gotten a bit buried in the threads but as a first approximation of a solution I believe an around_execution hook could fix the issue without having to deeply integrate tracing-rs into the core of async-graphql: #447

sunli829 · 2021-04-02T16:04:05Z

@sunli829 Super appreciate you having another look at this. I should have mentioned that my issues were also due to warn! calls inside my revolvers. My original proposal has gotten a bit buried in the threads but as a first approximation of a solution I believe an around_execution hook could fix the issue without having to deeply integrate tracing-rs into the core of async-graphql:

#447

Thank you, I have actually considered the solution you put forward. It will definitely solve the problem, but it's just that the changes are too big to make me hesitate. 😂

D1plo1d · 2021-04-02T16:06:14Z

Great, your already on it then! Just wanted to make sure that one hadn't been lost in all the back and forth. 😄

D1plo1d · 2021-04-02T16:17:22Z

@sunli829 re: #447 I initially had concerns that every extension would need to wrap the execution future in some future they create (also my the way #447 is written made no sense if more then one around_execution extension was used).

As a bit of an amendment to #447 I realize now that if around_execution received the same boxed future type that it returned then a default implementation could simply pass through the original future without wrapping it.

Thus there should not be any performance impact or extra wrapping futures for the other extensions that do not instrument() the execution future. So with that amendment that aspect at least now feels better to me.

Edit: Words

ibralu · 2021-04-02T16:18:10Z

Sorry for delay, yeah unfortunately it doesn't work as expected. I see no way of putting tracing in the extensions. Have to move tracing inside the internal implementation with the feature flag(maybe?). I think its common practice to have tracing embedded into the code itself. Async code ref in tracing crate

The way to make it work properly is with wrapping the futures with .instument from tracing::Instrument trait or #[instrument] attr macro. Not sure what the best implementation would be though.

So for example we could wrap execute()

    pub async fn execute(&self, request: impl Into<Request>) -> Response {
        let request = request.into();
        let extensions = self.create_extensions(Default::default());

        do_execute(..)
          .await
          .instrument(tracing::span!(
            target: "async_graphql::graphql",
            Level::INFO,
            "execute"
        ))
        .await
    }

or

   #[instument(name = "execute", level = "info", skip(self, request))]
    pub async fn execute(&self, request: impl Into<Request>) -> Response {
        .....
    }

That way we never have to care about passing or getting parent spans. So calling tracing::Span::current() would work as expected regardless of where we call it. Because they enter span on each poll().

Regarding the async_hook I don't think lifetime references would be fixable.

sunli829 · 2021-04-03T13:21:31Z

Good news, I have made a lot of progress! 🙂

menixator · 2021-04-04T06:18:07Z

@sunli829 I really appreciate the hard work you're putting into this project. I'm definitely going to help out when I get my new laptop. 👍

sunli829 · 2021-04-04T06:45:22Z

I have implemented a new extension mechanism on the master branch, which is better than the previous implementation, and I believe this problem has been completely solved. If you have time, please help me test it, thank you. 🙂

menixator · 2021-04-04T07:33:00Z

Tested it out with my project. Works exactly as I expect it to, so far. I will keep testing it and will let you know if there are any issues.

D1plo1d · 2021-04-04T16:51:24Z

async fn execute(&self, ctx: &ExtensionContext<'_>, next: NextExtension<'_>) -> Response;

Hells yes! ^_^

ibralu added the bug Something isn't working label Mar 19, 2021

ibralu changed the title ~~Tracing async resolvers doesn't work properly~~ Tracing extension doesn't work properly when used with async code Mar 19, 2021

sunli829 added a commit that referenced this issue Mar 20, 2021

Fix tracing extension doesn't work with async code. ##448

a7f8c11

sunli829 added a commit that referenced this issue Mar 20, 2021

Fix tracing extension doesn't work with async code. ##448

cf1fe6f

sunli829 mentioned this issue Mar 20, 2021

tracing-rs spans are getting mis-applied to unrelated futures. Expose an around_execution hook to fix the tracing extension #447

Closed

sunli829 added the help wanted Extra attention is needed label Mar 21, 2021

sunli829 removed the help wanted Extra attention is needed label Mar 22, 2021

sunli829 closed this as completed Apr 2, 2021

sunli829 reopened this Apr 3, 2021

async-graphql locked and limited conversation to collaborators Apr 3, 2021

async-graphql unlocked this conversation Apr 3, 2021

async-graphql locked and limited conversation to collaborators Apr 3, 2021

async-graphql unlocked this conversation Apr 3, 2021

sunli829 closed this as completed Apr 6, 2021

Tracing extension doesn't work properly when used with async code #448

Tracing extension doesn't work properly when used with async code #448

Comments

ibralu commented Mar 19, 2021

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

sunli829 commented Mar 20, 2021

ibralu commented Mar 20, 2021

sunli829 commented Mar 20, 2021 • edited Loading

sunli829 commented Mar 20, 2021

sunli829 commented Mar 20, 2021

sunli829 commented Mar 20, 2021

sunli829 commented Mar 20, 2021 • edited Loading

sunli829 commented Mar 20, 2021

D1plo1d commented Mar 20, 2021 • edited Loading

sunli829 commented Mar 20, 2021

D1plo1d commented Mar 20, 2021

ibralu commented Mar 21, 2021

sunli829 commented Mar 21, 2021

sunli829 commented Mar 22, 2021 • edited Loading

sunli829 commented Mar 22, 2021

ibralu commented Mar 22, 2021

D1plo1d commented Mar 25, 2021

D1plo1d commented Mar 25, 2021 • edited Loading

sunli829 commented Mar 25, 2021

sunli829 commented Mar 26, 2021

sunli829 commented Mar 26, 2021

sunli829 commented Apr 2, 2021

menixator commented Apr 2, 2021 • edited Loading

sunli829 commented Apr 2, 2021

sunli829 commented Apr 2, 2021 • edited Loading

menixator commented Apr 2, 2021 • edited Loading

sunli829 commented Apr 2, 2021

D1plo1d commented Apr 2, 2021 • edited Loading

sunli829 commented Apr 2, 2021

D1plo1d commented Apr 2, 2021

D1plo1d commented Apr 2, 2021 • edited Loading

ibralu commented Apr 2, 2021

sunli829 commented Apr 3, 2021

menixator commented Apr 4, 2021

sunli829 commented Apr 4, 2021 • edited Loading

menixator commented Apr 4, 2021 • edited Loading

D1plo1d commented Apr 4, 2021

sunli829 commented Mar 20, 2021 •

edited

Loading

sunli829 commented Mar 20, 2021 •

edited

Loading

D1plo1d commented Mar 20, 2021 •

edited

Loading

sunli829 commented Mar 22, 2021 •

edited

Loading

D1plo1d commented Mar 25, 2021 •

edited

Loading

menixator commented Apr 2, 2021 •

edited

Loading

sunli829 commented Apr 2, 2021 •

edited

Loading

menixator commented Apr 2, 2021 •

edited

Loading

D1plo1d commented Apr 2, 2021 •

edited

Loading

D1plo1d commented Apr 2, 2021 •

edited

Loading

sunli829 commented Apr 4, 2021 •

edited

Loading

menixator commented Apr 4, 2021 •

edited

Loading