Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create SDK TRACE context, a shared reference to tracer pipelines between Tracer/TracerProvider. Update tests. #590

Closed
wants to merge 16 commits into from

Conversation

jsuereth
Copy link
Contributor

@jsuereth jsuereth commented Feb 26, 2021

Precursor to #502

Changes

  • Create TracerContext for storing Sampler, Resource and SpanProcessor.
  • Update Tracer, TracerProvider to use shared reference to TracerContext
  • Update Span to no longer store two shared pointers to the same content (Tracer + Processor). Instead Span now pulls Processor from Tracer. This is preparation for pulling InsturmentationLibrary
  • TracerProvider now uniquely owns a SpanProcessor (to ensure appropriate lifecycle of flush/shutdown)
  • Update zpages extension to share data through a specific object + shared_ptr rather than directly sharing a SpanProcessor instance.
  • Update SpanProcessor to take in a sdk::trace::Span reference instead of Recordable directly. Note: Span already remembers the Tracer it came from, which (will) include InstrumentationLibrary and TracerContext

@codecov
Copy link

codecov bot commented Feb 26, 2021

Codecov Report

Merging #590 (51a48d5) into main (3d71e5e) will decrease coverage by 0.01%.
The diff coverage is 87.17%.

❗ Current head 51a48d5 differs from pull request most recent head 0356183. Consider uploading reports for the commit 0356183 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #590      +/-   ##
==========================================
- Coverage   94.40%   94.38%   -0.02%     
==========================================
  Files         191      192       +1     
  Lines        9023     9036      +13     
==========================================
+ Hits         8518     8529      +11     
- Misses        505      507       +2     
Impacted Files Coverage Δ
api/include/opentelemetry/trace/span_context.h 100.00% <ø> (ø)
api/include/opentelemetry/trace/trace_state.h 97.54% <ø> (-0.02%) ⬇️
sdk/src/trace/span.h 100.00% <ø> (ø)
sdk/src/trace/tracer_context.cc 61.53% <61.53%> (ø)
sdk/src/trace/tracer_provider.cc 80.95% <75.00%> (+12.77%) ⬆️
sdk/src/trace/tracer.cc 71.87% <80.00%> (-2.42%) ⬇️
ext/test/zpages/tracez_data_aggregator_test.cc 96.35% <100.00%> (+0.01%) ⬆️
ext/test/zpages/tracez_processor_test.cc 98.70% <100.00%> (+<0.01%) ⬆️
...clude/opentelemetry/sdk/common/atomic_unique_ptr.h 100.00% <100.00%> (ø)
sdk/src/trace/span.cc 89.47% <100.00%> (ø)
... and 3 more

@jsuereth jsuereth marked this pull request as ready for review February 27, 2021 15:57
@jsuereth jsuereth requested a review from a team February 27, 2021 15:57
sdk/include/opentelemetry/sdk/trace/tracer_context.h Outdated Show resolved Hide resolved
*/
const opentelemetry::sdk::resource::Resource &GetResource() const noexcept;

void ForceFlushWithMicroseconds(uint64_t timeout) noexcept;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to force-flush spans in some ephemeral compute environments. There's been SOME additions to the spec around ForceFlush. I think we need this to support that version of the spec.

sdk/include/opentelemetry/sdk/trace/tracer_context.h Outdated Show resolved Hide resolved
@seemk
Copy link
Contributor

seemk commented Mar 13, 2021

  • TracerProvider now uniquely owns a SpanProcessor (to ensure appropriate lifecycle of flush/shutdown)

Does this mean in the case of multiple tracer providers there will be multiple span processors and thus multiple exporters, each processor having it's own queue to flush? In the case of nginx I'll most likely need several Resource to specify service names for different virtual hosts / domains (or arbitrary route based service names) and in order to achieve that I need multiple tracer providers. The multiple processors/queues most likely won't become an issue, but each batch span processor does create its own thread so an nginx with multiple virtual domains might end up with a lot of threads created, which is a bit concerning, but might not be a problem in reality.

*
* TODO(jsuereth): This method will be reworked once multi-processor span support is added.
*/
std::unique_ptr<Recordable> ConsumeRecordable() {
Copy link
Member

@lalitb lalitb Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsuereth - This is something I am not able to understand. How will this method modified/replaced to get the correct Recordable to use from span object once we have multi-processor span support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for long delay. I'm ~60% through adding multi-processor support to show this (and prove it works). You're right I had to remove this method.

@lalitb
Copy link
Member

lalitb commented Mar 15, 2021

but might not be a problem in reality.

Have observed typical web-hosting companies creating tens of thousands of (dynamic/static) vhost configurations to be handled by single nginx/apache servers :)

@jsuereth
Copy link
Contributor Author

@seemk From a Resource standpoint, Nginx would be the service providing telemetry. I'm not sure we've specified (well) how to handle virtual domain handling, etc., but in this case I think the "spirit" of resource is that you're reporting for the NGinx module hosting all of those domains. You'd use labels to identify which virtual host each trace is associated with. This likely is a bigger discussion than what fits on this CL.

Additionally, regarding thread pool and resource constraints, I agree that batch-span-processor consuming a thread could be an issue. I think that's worth opening a bug against. We might need to adapt BatchProcessor to use some kind of shared-thread / event-queue model where you can limit # of threads used by the entire SDK. I don't think we should try to limit # of TracerProviders or force 1-batch-span-processor today as that's antithetical to some of the design ideas. I do agree current incarnation of BSP needs some work.

Again apologies for my slow progress :) Looking forward to chatting over things more in our next SiG. I should have a working multi-processor implementation by then, and we can figure out where to go next, given current state + spec.

@lalitb
Copy link
Member

lalitb commented Mar 22, 2021

. We might need to adapt BatchProcessor to use some kind of shared-thread / event-queue model where you can limit # of threads used by the entire SDK.

There was some work done initially to support async-io using event-loop (#63) - not sure if we can use this to get rid of thread-model for batchprocessor.

@seemk
Copy link
Contributor

seemk commented Mar 23, 2021

@seemk From a Resource standpoint, Nginx would be the service providing telemetry. I'm not sure we've specified (well) how to handle virtual domain handling, etc., but in this case I think the "spirit" of resource is that you're reporting for the NGinx module hosting all of those domains. You'd use labels to identify which virtual host each trace is associated with. This likely is a bigger discussion than what fits on this CL.

Additionally, regarding thread pool and resource constraints, I agree that batch-span-processor consuming a thread could be an issue. I think that's worth opening a bug against. We might need to adapt BatchProcessor to use some kind of shared-thread / event-queue model where you can limit # of threads used by the entire SDK. I don't think we should try to limit # of TracerProviders or force 1-batch-span-processor today as that's antithetical to some of the design ideas. I do agree current incarnation of BSP needs some work.

Again apologies for my slow progress :) Looking forward to chatting over things more in our next SiG. I should have a working multi-processor implementation by then, and we can figure out where to go next, given current state + spec.

Thanks! Yeah, I wasn't sure myself how the virtual domain handling should happen inside nginx and whether creating multiple resources would be the way to go. I do like your idea of a single resource with custom attributes. In any case I guess the multiple threads won't be an issue for now 🙂

- Migrate SpanProcessor to use `ExportableSpan`
- `ExportableSpan` has a registry of `Recordable`s denoated from processors
- `ExportableSpan` replaces `Recordable` in `Span` implementation for now
- Do some gymnastics around unique_ptr + ownership
- Update SDK tests (exporters/ext tests still borked)
- For now, `ExportableSpan` has shared ptr reference to originating Tracer.  TBD on whether this stays.
@pyohannes
Copy link
Contributor

It would be great to split this PR up, that would make reviewing much easier.

Regarding the ExportableSpan abstraction, I'd be curious to hear your opinion about a different approach.

  1. Have a MultiSpanProcessor, like you sketched it out. Internally processors are stored in a map, mapping processor addresses to processors.

    vector<std::unique_ptr<SpanProcessor>> processors;
    processors.push_back(std::move(zipkin_processor));
    processors.push_back(std::move(otel_processor));
    auto processor = new MultiSpanProcessor(processors, options);
  2. Have a type MultiRecordable, which implements the Recordable interface, and encapsulates several regular recordables. The MultiSpanProcessor would initialize it like this:

    std::unique_ptr<Recordable> MultiSpanProcessor::MakeRecordable() noexcept override
    {
      auto recordable = std::unique_ptr<Recordable>(new MultiRecordable);
      foreach (auto processor  : processors_) {
        recordable.Add(processor_.second->MakeRecordable(), processor_.get());
      }
      return recordable;
    }
  3. All the setters on MultiRecordable would fan out to all the encapsulated recordables:

    void MultiRecordable::SetStatus(opentelemetry::trace::StatusCode code,
                    nostd::string_view description) noexcept
    {
        for (auto& recordable : recordables_) 
        {
             recordable.second->SetStatus(code, description);
        }
    }
  4. At calls to OnStart and OnEnd, the MultiSpanProcessor fans out to all processors:

    void MultiSpanProcessor::OnEnd(std::unique_ptr<Recordable> &&recordable) noexcept
    {
      auto multi_recordable = static_cast<MultiRecordable>(recordable);
      auto recordables = multi_recordables.Release();
    
      for (auto& r : recordables)
      {
        auto processor = processors_.find(r.GetProcessorPtr());
        if (processor != processors_.end())
        {
          processor.second->OnEnd(span->ReleaseExportableSpanFor(*processor));
        }
      }
    }

I don't see many big benefits in having a separate ExportableSpan abstraction, it might even be confusing for many users (as the recordable approach by itself is already complicated enough).

@jsuereth
Copy link
Contributor Author

@pyohannes I tried to take an approach just like that. In fact I even called it "MultiRecordable". TL;DR; OnStart broke it for me.

Primary issue is you need to support these things:

  1. Span needs to own the recordable
  2. OnStart callback needs a reference to the appropriate recordable.
  3. OnEnd needs to take back ownership of the recordable
  4. Recordable is generated by either Processor or Exporter
  5. Recordable does not attach Resource or InstrumentationLibrary to Spans. ExportableSpan was my place to hang this.

#2 and #5 caused issues in MultiRecordable design so I went this direction.

I've started fragmenting out the PR, so I'll send a multi-processor focused PR where we can hash out ideas.

@lalitb
Copy link
Member

lalitb commented Apr 2, 2021

I've started fragmenting out the PR, so I'll send a multi-processor focused PR where we can hash out ideas.

@jsuereth - Hope it is good to hold the review comment for this PR as you would be splitting this into separate PRs - TracerContext and multi-processor. Just want to be sure that you are not waiting for review comments here :)

@jsuereth
Copy link
Contributor Author

jsuereth commented Apr 2, 2021

@lalitb No, I'm still splitting this up. Ran into some illness that ruined my whole week, should see something from me this weekend/early next week.

@lalitb
Copy link
Member

lalitb commented Apr 3, 2021

@lalitb No, I'm still splitting this up. Ran into some illness that ruined my whole week, should see something from me this weekend/early next week.

No issues @jsuereth . This was just to ensure you are not waiting for comments on this PR. Take care of yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants