Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opentelemetry-otlp async pipeline stuck after FramedWrite #473

Closed
clux opened this issue Mar 8, 2021 · 10 comments
Closed

opentelemetry-otlp async pipeline stuck after FramedWrite #473

clux opened this issue Mar 8, 2021 · 10 comments
Labels
A-trace Area: issues related to tracing M-exporter-otlp release:required-for-stable Must be resolved before GA release, or nice to have before GA.

Comments

@clux
Copy link
Contributor

clux commented Mar 8, 2021

The following minimal otlp example with latest tracing, tracing-subscriber, tracing-opentelemetry, opentelemetry, opentelemetry-otlp gets stuck inside tonic or h2 when pushing spans in an async context:

use tracing_subscriber::layer::SubscriberExt;

type Error = Box<dyn std::error::Error + Send + Sync + 'static>;
type Result<T> = std::result::Result<T, Error>;

#[tracing::instrument]
async fn trace_test() ->  Result<()> {
    tracing::warn!("start");
    tracing::warn!("end! span: {:?}", tracing::Span::current());
    Ok(())
}

#[tokio::main]
async fn main() -> Result<()> {
    // k port-forward -n monitoring service/grafana-agent-traces 55680:55680
    let otlp_endpoint = std::env::var("OPENTELEMETRY_ENDPOINT_URL").unwrap_or("http://0.0.0.0:55680".to_string());
    let (tracer, _uninstall) = opentelemetry_otlp::new_pipeline()
        .with_endpoint(&otlp_endpoint)
        .install()?;

    // Register all subscribers
    let collector = tracing_subscriber::registry()
        .with(tracing_opentelemetry::layer().with_tracer(tracer))
        .with(tracing_subscriber::fmt::layer());

    tracing::subscriber::set_global_default(collector)?;
    let r = trace_test().await?; // THIS HANGS INDEFINITELY
    tracing::info!("trace test returned: {:?}", r);

    Ok(())
}

which prints the following and hangs:

Mar 08 13:05:07.540  WARN trace_test: blah: start
Mar 08 13:05:07.540  WARN trace_test: blah: end! span: Span { name: "trace_test", level: Level(Info), target: "blah", id: Id(1), module_path: "blah", line: 6, file: "blah.rs" }
Mar 08 13:05:07.541 TRACE tower::buffer::service: sending request to buffer worker
Mar 08 13:05:07.541 TRACE tower::buffer::worker: worker polling for next message
Mar 08 13:05:07.541 TRACE tower::buffer::worker: processing new request
Mar 08 13:05:07.541 TRACE tower::buffer::worker: resumed=false worker received request; waiting for service readiness
Mar 08 13:05:07.541 TRACE tonic::transport::service::reconnect: poll_ready; idle
Mar 08 13:05:07.541 TRACE tonic::transport::service::reconnect: poll_ready; connecting
Mar 08 13:05:07.541 TRACE hyper::client::connect::http: Http::connect; scheme=Some("http"), host=Some("0.0.0.0"), port=Some(Port(55680))
Mar 08 13:05:07.541 DEBUG hyper::client::connect::http: connecting to 0.0.0.0:55680
Mar 08 13:05:07.541 TRACE tonic::transport::service::reconnect: poll_ready; not ready
Mar 08 13:05:07.541 TRACE tower::buffer::worker: service.ready=false delay
Mar 08 13:05:07.541 TRACE tower::buffer::worker: worker polling for next message
Mar 08 13:05:07.541 TRACE tower::buffer::worker: resuming buffered request
Mar 08 13:05:07.541 TRACE tower::buffer::worker: resumed=true worker received request; waiting for service readiness
Mar 08 13:05:07.541 TRACE tonic::transport::service::reconnect: poll_ready; connecting
Mar 08 13:05:07.541 DEBUG hyper::client::connect::http: connected to 0.0.0.0:55680
Mar 08 13:05:07.541 TRACE hyper::client::conn: client handshake Http2
Mar 08 13:05:07.541 DEBUG h2::client: binding client connection
Mar 08 13:05:07.542 DEBUG h2::client: client connection bound
Mar 08 13:05:07.542 DEBUG FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::codec::framed_write: send frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::frame::settings: encoding SETTINGS; len=18
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::frame::settings: encoding setting; val=EnablePush(0)
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::frame::settings: encoding setting; val=InitialWindowSize(2097152)
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::frame::settings: encoding setting; val=MaxFrameSize(16384)
Mar 08 13:05:07.542 TRACE FramedWrite::buffer{frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }}: h2::codec::framed_write: encoded settings rem=27

deps:

tokio = { version = "1.2.0", features = ["full"] }
tracing = "0.1.25"
tracing-futures = "0.2.5"
tracing-subscriber = "0.2.16"
tracing-opentelemetry = "0.11.0"
opentelemetry = { version = "0.12.0", features = ["trace"] }
opentelemetry-otlp = { version = "0.5.0", features = ["async"] }

Not sure what exactly is going on here, but have done the following debugging to pinpoint a few ways that work:

Bypassing opentelemetry_otlp

The entire pipeline works perfectly without opentelemetry_otlp by replacing it with a opentelemetry stdout pipeline:

diff --git blah.rs blah.rs
index dfb2cd1..529646b 100644
--- blah.rs
+++ blah.rs
@@ -14,9 +14,11 @@ async fn trace_test() ->  Result<()> {
 async fn main() -> Result<()> {
     // k port-forward -n monitoring service/grafana-agent-traces 55680:55680
     let otlp_endpoint = std::env::var("OPENTELEMETRY_ENDPOINT_URL").unwrap_or("http://0.0.0.0:55680".to_string());
-    let (tracer, _uninstall) = opentelemetry_otlp::new_pipeline()
-        .with_endpoint(&otlp_endpoint)
-        .install()?;
+    use opentelemetry::{sdk::export::trace::stdout, trace::Tracer};
+    let (tracer, _uninstall) = stdout::new_pipeline().install();
 
     // Register all subscribers
     let collector = tracing_subscriber::registry()

Spans printed fine (albeit with two logging layers), and the program exits.

Bypassing a global subscriber

The following works end-to-end using opentelemetry-otlp, and shows that my endpoint (a port-forwarded grafana-traces-agent (tempo)) is able to receive the data normally (they show up after program exit on their interface)

diff --git blah.rs blah.rs
index dfb2cd1..85ce7b3 100644
--- blah.rs
+++ blah.rs
@@ -4,7 +4,7 @@ type Error = Box<dyn std::error::Error + Send + Sync + 'static>;
 type Result<T> = std::result::Result<T, Error>;
 
 #[tracing::instrument]
-async fn trace_test() ->  Result<()> {
+fn trace_test() ->  Result<()> {
     tracing::warn!("start");
     tracing::warn!("end! span: {:?}", tracing::Span::current());
     Ok(())
@@ -24,15 +24,10 @@ async fn main() -> Result<()> {
         .with(tracing_subscriber::fmt::layer());
 
-    tracing::subscriber::set_global_default(collector)?;
-    let r = trace_test().await?;
-    tracing::info!("trace test returned: {:?}", r);
+    tracing::subscriber::with_default(collector, || {
+        let r = trace_test();
+        tracing::info!("trace test returned: {:?}", r);
+    });
 
     Ok(())
 }

but this only works in non-async contexts afaikt, and looks awkward to integrate with a larger application.

..so. Am I doing something obviously wrong here? Why is opentelemetry_otlp getting stuck in an async context with a global subscriber, and doing perfectly fine in a non-async context using a scoped subscriber?

@clux
Copy link
Contributor Author

clux commented Mar 8, 2021

Btw have pushed my testing code for this issue to clux/tracing-otlp-test.

It has a few extra things in there relating to fetching trace ids, but it's not relevant to the issue at hand (that's just to help me debug against tempo, outcome is the same).

@clux clux changed the title opentelemetry-oltp async pipeline stuck after FramedWrite opentelemetry-otlp async pipeline stuck after FramedWrite Mar 8, 2021
@clux
Copy link
Contributor Author

clux commented Mar 8, 2021

Ok, more testing. This does not happen with grpc-sys:

-opentelemetry-otlp = { version = "0.5.0", features = ["async", "tokio"] }
+opentelemetry-otlp = { version = "0.5.0", features = ["tokio", "grpc-sys", "openssl"], default-features=false }

works. It sends spans to tempo, they are visible there, and the app doesn't hang.
Only needed one extra modification in the passed endpoint url:

-http://0.0.0.0:55680
+0.0.0.0:55680

clux added a commit to kube-rs/controller-rs that referenced this issue Mar 8, 2021
currently does not work with tonic for some reason: open-telemetry/opentelemetry-rust#473
@TommyCpp
Copy link
Contributor

TommyCpp commented Mar 9, 2021

Thanks for the detailed report and example! This example also seems to be working if the span is not reported via tracing. @jtescher any thoughts?

@jtescher
Copy link
Member

jtescher commented Mar 9, 2021

Hm I can't actually reproduce the behavior. Both the code in the description and cloning and running https://github.com/clux/tracing-otlp-test/blob/master/blah.rs seem to run and exit normally. @clux anything specific needed to show the hanging behavior?

@TommyCpp
Copy link
Contributor

TommyCpp commented Mar 9, 2021

@jtescher I was able to reproduce the issue after I spin up an otel-collector container and use the example to connect it

@jtescher
Copy link
Member

jtescher commented Mar 9, 2021

Oh I see the issue now (hadn't bound the ports properly when starting the otlp docker container was my issue).

So this looks like the issue is this configuration is actually trying to trace itself. The cycle is completed from top level tracing span -> otlp exporter -> tonic -> tonic dependencies starting tracing spans and so it recurses that way and hangs.

The internal tracing spans are at TRACE level, so a simple way to get this working again is to add a tracing subscriber EnvFilter so your subscriber becomes:

let collector = tracing_subscriber::registry()
    .with(tracing_opentelemetry::layer().with_tracer(tracer))
    .with(tracing_subscriber::EnvFilter::from("DEBUG"))
    .with(tracing_subscriber::fmt::layer());

Can likely come up with a better solution for detecting this case in tracing-opentelemetry, but for now that should get you up and running again.

@clux
Copy link
Contributor Author

clux commented Mar 9, 2021

Ah, a tracing cycle. Interesting. The EnvFilter does indeed fix the minimal example, but it leaves my more complex setup broken.

I am thinking that this might be a bit more insidious in a bigger example where dependencies are shared.
In particular, my agent (manager) uses tonic deps like hyper (which could be useful to trace), along with a web server.
With both of those agents running installed, the envfilter solution doesn't seem to work no matter what I do (and it doesn't even respond to SIGTERM when using tonic)

If you are interested in the bigger example, keep in mind it requires k8s cluster to talk to + installing a crd before running, but otherwise not sure how much I can really remove as it's already an example controller, but I could try to make the original tracing-otlp-test repo bigger as well so it doesn't require quite so much as controller-rs.

@clux
Copy link
Contributor Author

clux commented Mar 10, 2021

Just a small update, but no success this time: I tried to make the bug repo re-trigger the bug with an updated EnvFilter, because it is still present on controller-rs. Tried a few things like including a hyper .get inside the traced async fn and spawning a long running task outside, but neither triggered the bug, so haven't pushed anything new there. Guess there's something funky going on with a combination of a long running task and actix in controller-rs.

Ultimately had to give up on it for now, and am stuck on grpcio for the time being.

@clux
Copy link
Contributor Author

clux commented Mar 27, 2021

Ok, I have finally managed to make it work with tonic now.

Still not sure why it's hanging in my more advanced app, but have at least found out that it's related to the simple exporter, which I probably shouldn't use anyway, and am able to work around it, without switching out tonic.

When I enabled "tokio-support" feature on opentelemetry (0.12), it stopped hanging with tonic. From what i could gather from the older docs this seems to force the batch exporter.

@TommyCpp TommyCpp added the A-trace Area: issues related to tracing label Jan 9, 2022
@hdost hdost added this to the Tracing API And SDK Stable milestone Mar 2, 2023
@hdost hdost added the release:required-for-stable Must be resolved before GA release, or nice to have before GA. label Oct 17, 2023
@cijothomas
Copy link
Member

Closing as we can track a proper fix via #761
OTLP Examples are now showing how to avoid this problem (for logs), by using apt filters in tracing::subscriber. Hopefully that is a decent enough mitigation until we can solve the issue with Suppression flag in Context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-trace Area: issues related to tracing M-exporter-otlp release:required-for-stable Must be resolved before GA release, or nice to have before GA.
Projects
None yet
Development

No branches or pull requests

5 participants