Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing extension doesn't work properly when used with async code #448

Closed
ibralu opened this issue Mar 19, 2021 · 37 comments
Closed

Tracing extension doesn't work properly when used with async code #448

ibralu opened this issue Mar 19, 2021 · 37 comments
Labels
bug Something isn't working

Comments

@ibralu
Copy link

ibralu commented Mar 19, 2021

Expected Behavior

Should build correct spans tree
image

Actual Behavior

Inconsistent span trees, some spans are missing parent.
Don't think span.with_subscriber(|(id, d)| d.enter(id)) works in multi threaded runtime

image

Steps to Reproduce the Problem

  1. Clone this repo https://github.com/ibralu/async-graphql-tracing. I've used starwars example with sleep on resolvers
  2. Start jaeger
docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 14250:14250 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.22
  1. Test with query
{
  hero(episode: EMPIRE) {
    id
    name
    friends {
      id
      name
    }
    appearsIn
  }
}


Specifications

  • Version: 2.6
  • Platform: macOS
  • Subsystem: 11.2
@ibralu ibralu added the bug Something isn't working label Mar 19, 2021
@ibralu ibralu changed the title Tracing async resolvers doesn't work properly Tracing extension doesn't work properly when used with async code Mar 19, 2021
@sunli829
Copy link
Collaborator

Thanks for the test, I have fixed it in v2.6.1 version. 🙂

@ibralu
Copy link
Author

ibralu commented Mar 20, 2021

Thank you for the fix, however, I don't think that would work though, as @D1plo1d has mentioned in #447 we need to wrap the future with instrument
https://github.com/tokio-rs/tracing#in-asynchronous-code

@sunli829
Copy link
Collaborator

sunli829 commented Mar 20, 2021

Thank you for the fix, however, I don't think that would work though, as @D1plo1d has mentioned in #447 we need to wrap the future with instrument
https://github.com/tokio-rs/tracing#in-asynchronous-code

I refer to the code implementation of instrument, in fact it is to call enter repeatedly in different threads.

https://github.com/tokio-rs/tracing/blob/0e8f920d4bc47643d9a035ed2bbe9504ec076ca4/tracing/src/instrument.rs#L148

@sunli829
Copy link
Collaborator

Sorry, there still seems to be a problem.

@sunli829
Copy link
Collaborator

I still haven't found a solution. 😂

@sunli829
Copy link
Collaborator

I am going to switch to opentelemetry, tracing makes me feel very confused.

@sunli829
Copy link
Collaborator

sunli829 commented Mar 20, 2021

I added OpenTelemetry support to the master branch, and it works very well now.
The way tracing works in multithreading is very strange, I decided to give up support for it. @ibralu @D1plo1d

@sunli829
Copy link
Collaborator

image

@D1plo1d
Copy link
Contributor

D1plo1d commented Mar 20, 2021

@sunli829 I'm sorry to hear that. I'm not sure if this helps re: tracing but I think the key to their design is entering the span inside poll and dropping it at the end of poll.

Entering span anywhere else will not work because it will wrap all futures that execute on that thread until it is dropped.

The way they've built it every async thread is only inside the span for the period of a future's poll function - as soon as the poll function returns the drop handler exits the span.

@sunli829
Copy link
Collaborator

I think there is no problem with my last modification, but it just doesn't work. This is related to cross-threading.

I ran into the same problem as described here, but the more troublesome thing is that I need a lot of Spans.
tokio-rs/tracing#1188

@D1plo1d
Copy link
Contributor

D1plo1d commented Mar 20, 2021

My apologies if I'm missing something but looking at commit cf1fe6f it looks as though enter is being called outside of await. As I understand it this is not equivalent to instrument. As I understand it async_graphql currently handles spans like this:

Thread 1:

  1. ExecutionFuture.poll starts
  2. Execution Start (enter span)
  3. Do some synchronous execution in the correct span
  4. Hit some asynchronous IO
  5. ExecutionFuture.poll returns but span is not dropped
  6. Poll an unrelated future with the span still applied to the current thread.
  7. [Various futures run correctly and incorrectly in the span]
  8. Final call to ExecutionFuture.poll drops the span

How Instrument works:

  1. ExecutionFuture.poll starts
  2. Execution Start (enter span)
  3. Do some synchronous execution in the correct span
  4. Hit some asynchronous IO
  5. ExecutionFuture.poll returns and the span is dropped (and thereby removed from the current thread).
  6. Poll an unrelated future without the span
  7. [Various futures run correctly with the span only applied during Execution future.poll]

@ibralu
Copy link
Author

ibralu commented Mar 21, 2021

Thank you for your effort, I think same problem would happen with opentelemtry. Pls check the opentelemtry async example https://github.com/open-telemetry/opentelemetry-rust/tree/main/examples/async they call any_future.with_context() (I think same logic as tracing any_future.instrument()) so both opentelemetry and tracing won't work properly unless we actualy pass futures to extensions (e.g. Output::resolve() to resolve_start) and return wrapped version

@sunli829
Copy link
Collaborator

I resumed the tracing extension and sorted out the code, but I still don't understand what is wrong.

@sunli829 sunli829 added the help wanted Extra attention is needed label Mar 21, 2021
@sunli829
Copy link
Collaborator

sunli829 commented Mar 22, 2021

After my repeated tests, there is no problem with tracing and opentelemetry. 😁 @ibralu @D1plo1d

@sunli829 sunli829 removed the help wanted Extra attention is needed label Mar 22, 2021
@sunli829
Copy link
Collaborator

Released in v2.6.3

@ibralu
Copy link
Author

ibralu commented Mar 22, 2021

Hi, thank you😁 gonna try it this week

@D1plo1d
Copy link
Contributor

D1plo1d commented Mar 25, 2021

I've just updated! 😄

@D1plo1d
Copy link
Contributor

D1plo1d commented Mar 25, 2021

Hmm, do we still have a way to instrument websockets with a custom root span (eg. I'm using async_graphql_warp::graphql_subscription_with_data) now that TracingConfig.parent_span() has been replaced?

@sunli829
Copy link
Collaborator

For a single request, it is easy to customize the parent span. But for subscriptions, asynchronous streaming is at a deeper level, maybe I should restore TracingConfig.parent_span.

let span = tracing::info_span!("root");
Ok::<_, Infallible>(Response::from(
    schema.execute(request).instrument(span).await,
))

@sunli829
Copy link
Collaborator

I made a lot of changes, and now the Tracing extension is finally working great!

@sunli829
Copy link
Collaborator

Released in v2.7.0

@sunli829
Copy link
Collaborator

sunli829 commented Apr 2, 2021

I think this issue has been resolved, if you find that it is still incorrect, please feel free to open it.

@sunli829 sunli829 closed this as completed Apr 2, 2021
@menixator
Copy link

menixator commented Apr 2, 2021

Tracing still won't work with async graphql.
We will have to use .instrument to ATTACH spans to futures so that when they are polled, the correct span gets entered and when poll returns, the span would be exited.

The current implementation is not at all correct because it drops the Entered struct at the end of enter_span functions.

https://github.com/async-graphql/async-graphql/blob/master/src/extensions/tracing.rs#L98

However, it is not as easy as simpling holding onto the return values from Span::enter() function. Since we are using futures here, every time the Future's poll() method returns, we have to exit the span so that another task will not log to the incorrect span. The Instrument trait helps out with this.

Instrument trait defines an instrument method on all futures that returns a wrapper struct for a future, and a span. Every time the Instrumented struct is polled, the attached span will be entered. When the inner future's poll returns, it will exit the span.

Following is a simplified demonstration of how async-graphql is trying to handle tracing vs how it should handle it.

use tracing::{span, Level};
use tracing_subscriber::{fmt::time::ChronoLocal, EnvFilter, FmtSubscriber};
use tokio::time::Duration;
use tracing::Instrument;

#[tokio::main]
async fn main() {
    let filter = EnvFilter::from_default_env();
    let subscriber = FmtSubscriber::builder()
        // all spans/events with a level higher than TRACE (e.g, debug, info, warn, etc.)
        // will be written to stdout.
        .with_env_filter(filter)
        // .json()
        //.pretty()
        .with_timer(ChronoLocal::rfc3339())
        // completes the builder.
        .finish();

    tracing::subscriber::set_global_default(subscriber).expect("setting default subscriber failed");
    let span = span!(Level::INFO, "my_tracing_demo");
    let _s = span.enter();

    
    // This is currently what async-graphql is doing right now
    tokio::task::spawn(async move{
        let span = span!(Level::INFO, "ONE");
        let handle = span.enter();
        example_async_fn_one().await
    });

    tokio::task::spawn(async move {
        let span = span!(Level::INFO, "TWO");
        let handle = span.enter();
        example_async_fn_two().await
    });
    tokio::time::sleep(Duration::from_secs(30)).await;

    
    
    /*
    // This is what it should be doing
    tokio::task::spawn(async move{
        let span = span!(Level::INFO, "ONE");
        example_async_fn_one().instrument(span).await
    });

    tokio::task::spawn(async move {
        let span = span!(Level::INFO, "TWO");
        example_async_fn_two().instrument(span).await
    });
    tokio::time::sleep(Duration::from_secs(30)).await;
    */
}


async fn example_async_fn_one(){
    for i in 0..10 {
        tracing::info!("hello from example one!");
        tokio::time::sleep(Duration::from_secs(1)).await;
    }
}

async fn example_async_fn_two(){
    for i in 0..10 {
        tracing::info!("hello from example two!");
        tokio::time::sleep(Duration::from_secs(2)).await;
    }
}

If you run the example above, without the instrument call, some of the traces will be output from the incorrect span.

So, to fix this, I think we will have to instrument the resolution futures within the library. But I don't think we can get away with that without making tracing a core part of async-graphql(Not that I am opposed to it). Ultimately, it's impossible to implement it as an extension.

@sunli829
Copy link
Collaborator

sunli829 commented Apr 2, 2021

So, to fix this, I think we will have to instrument the resolution futures within the library. But I don't think we can get away with that without making tracing a core part of async-graphql(Not that I am opposed to it). Ultimately, it's impossible to implement it as an extension.

In fact, my implementation is no different from instrument. If you can provide a simple demo to reproduce the problem you described, I would be happy to fix it. 🙂

@sunli829
Copy link
Collaborator

sunli829 commented Apr 2, 2021

Here is an example, it works very well.

https://github.com/sunli829/async-graphql-tracing

@menixator
Copy link

menixator commented Apr 2, 2021

If I use tracing::{warn,trace,info...} within the resolvers, I would expect information that is attached to the resolution span such as id, path, parent_type and return_type to be present in every log message.

  fn resolve_start(&mut self, _ctx: &ExtensionContext<'_>, info: &ResolveInfo<'_>) {
        let parent = match info.resolve_id.parent {
            Some(parent_id) if parent_id > 0 => self.spans.get(&resolve_span_id(parent_id)),
            _ => self.spans.get(&EXECUTE_CTX),
        };

        if let Some(parent) = parent {
            let span = span!(
                target: "async_graphql::graphql",
                parent: parent,
                Level::INFO,
                "field",
                id = %info.resolve_id.current,
                path = %info.path_node,
                parent_type = %info.parent_type,
                return_type = %info.return_type,
            );
            self.enter_span(resolve_span_id(info.resolve_id.current), span);
        }
    }

However, if I modify the hero resolver in your example by adding a call to tracing::info!(), that data is not attached to the log message.

impl QueryRoot {
    async fn hero(
        &self,
        ctx: &Context<'_>,
        #[graphql(
            desc = "If omitted, returns the hero of the whole saga. If provided, returns the hero of that particular episode."
        )]
        episode: Episode,
    ) -> Character {
        tracing::info!("I expect contextual information to be attached to this log");
        tokio::time::sleep(std::time::Duration::from_secs(1)).await;
        if episode == Episode::Empire {
            Human(ctx.data_unchecked::<StarWars>().luke).into()
        } else {
            Droid(ctx.data_unchecked::<StarWars>().artoo).into()
        }
    }

image

When you're creating a span and entering it, the fields attached to the span will be present in any subsequent events so long as the return value from span.enter() is not dropped.

Eg:
image

Output:
image

Notice how request_path is present in both the log messages even though I didn't explicitly pass them to tracing::info. That's the part that's broken right now. Tracing would work as a basic logger as of now. But the full power of tracing cannot be utilized with async-graphql as it is right now.

Also, sorry about the delay. My laptop is a potato so compiling stuff takes a while lol.

@sunli829
Copy link
Collaborator

sunli829 commented Apr 2, 2021

I understand the problem you encountered. Because of the lack of the current span, tracing::{warn,trace,info...} can't work at all. I did neglect this problem, thank you for telling me!
I will carefully consider how to solve this problem. 🙂

@D1plo1d
Copy link
Contributor

D1plo1d commented Apr 2, 2021

@sunli829 Super appreciate you having another look at this. I should have mentioned that my issues were also due to warn! calls inside my revolvers. My original proposal has gotten a bit buried in the threads but as a first approximation of a solution I believe an around_execution hook could fix the issue without having to deeply integrate tracing-rs into the core of async-graphql: #447

@sunli829
Copy link
Collaborator

sunli829 commented Apr 2, 2021

@sunli829 Super appreciate you having another look at this. I should have mentioned that my issues were also due to warn! calls inside my revolvers. My original proposal has gotten a bit buried in the threads but as a first approximation of a solution I believe an around_execution hook could fix the issue without having to deeply integrate tracing-rs into the core of async-graphql:

#447

Thank you, I have actually considered the solution you put forward. It will definitely solve the problem, but it's just that the changes are too big to make me hesitate. 😂

@D1plo1d
Copy link
Contributor

D1plo1d commented Apr 2, 2021

Great, your already on it then! Just wanted to make sure that one hadn't been lost in all the back and forth. 😄

@D1plo1d
Copy link
Contributor

D1plo1d commented Apr 2, 2021

@sunli829 re: #447 I initially had concerns that every extension would need to wrap the execution future in some future they create (also my the way #447 is written made no sense if more then one around_execution extension was used).

As a bit of an amendment to #447 I realize now that if around_execution received the same boxed future type that it returned then a default implementation could simply pass through the original future without wrapping it.

Thus there should not be any performance impact or extra wrapping futures for the other extensions that do not instrument() the execution future. So with that amendment that aspect at least now feels better to me.

Edit: Words

@ibralu
Copy link
Author

ibralu commented Apr 2, 2021

Sorry for delay, yeah unfortunately it doesn't work as expected. I see no way of putting tracing in the extensions. Have to move tracing inside the internal implementation with the feature flag(maybe?). I think its common practice to have tracing embedded into the code itself. Async code ref in tracing crate

The way to make it work properly is with wrapping the futures with .instument from tracing::Instrument trait or #[instrument] attr macro. Not sure what the best implementation would be though.

So for example we could wrap execute()

    pub async fn execute(&self, request: impl Into<Request>) -> Response {
        let request = request.into();
        let extensions = self.create_extensions(Default::default());

        do_execute(..)
          .await
          .instrument(tracing::span!(
            target: "async_graphql::graphql",
            Level::INFO,
            "execute"
        ))
        .await
    }

or

   #[instument(name = "execute", level = "info", skip(self, request))]
    pub async fn execute(&self, request: impl Into<Request>) -> Response {
        .....
    }

That way we never have to care about passing or getting parent spans. So calling tracing::Span::current() would work as expected regardless of where we call it. Because they enter span on each poll().

Regarding the async_hook I don't think lifetime references would be fixable.

@sunli829 sunli829 reopened this Apr 3, 2021
@async-graphql async-graphql locked and limited conversation to collaborators Apr 3, 2021
@async-graphql async-graphql unlocked this conversation Apr 3, 2021
@async-graphql async-graphql locked and limited conversation to collaborators Apr 3, 2021
@async-graphql async-graphql unlocked this conversation Apr 3, 2021
@sunli829
Copy link
Collaborator

sunli829 commented Apr 3, 2021

Good news, I have made a lot of progress! 🙂

@menixator
Copy link

@sunli829 I really appreciate the hard work you're putting into this project. I'm definitely going to help out when I get my new laptop. 👍

@sunli829
Copy link
Collaborator

sunli829 commented Apr 4, 2021

I have implemented a new extension mechanism on the master branch, which is better than the previous implementation, and I believe this problem has been completely solved. If you have time, please help me test it, thank you. 🙂

@menixator
Copy link

menixator commented Apr 4, 2021

Tested it out with my project. Works exactly as I expect it to, so far. I will keep testing it and will let you know if there are any issues.

@D1plo1d
Copy link
Contributor

D1plo1d commented Apr 4, 2021

async fn execute(&self, ctx: &ExtensionContext<'_>, next: NextExtension<'_>) -> Response;

Hells yes! ^_^

@sunli829 sunli829 closed this as completed Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants