Skip to content
This repository has been archived by the owner on Mar 29, 2024. It is now read-only.

Request tracing #91

Open
wants to merge 4 commits into
base: release/8.8
Choose a base branch
from
Open

Request tracing #91

wants to merge 4 commits into from

Conversation

noblepaul
Copy link
Collaborator

@noblepaul noblepaul commented Jul 9, 2021

String reqId = MDC.get(_REQ_ID);
if (reqId == null) return;
if (log.isInfoEnabled()) {
String type = IS_FORWARDED.get() ? "forwarded" : "external";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using this can we identify top level of request? (top query request vs subrequest for shard)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@hiteshk25
Copy link

Would it possible to add unit test for it?

@hiteshk25
Copy link

@noblepaul followup,

  1. would it possible to add test?
  2. Will it have impact on cpu or memory?

@noblepaul
Copy link
Collaborator Author

noblepaul commented Jul 22, 2021

  1. would it possible to add test?

done

  1. Will it have impact on cpu or memory?

This has very little impact on CPU/memory. My own perf tests showed an extra 0.5milliseconds in increased latency

@hiteshk25
Copy link

hiteshk25 commented Jul 26, 2021

Thanks @noblepaul. i will merge this further!

@patsonluk
Copy link

patsonluk commented Aug 10, 2021

Hi 👋🏼 , it's nice to see more tracing using standardized framework like OpenTracing! :)

Since I am very new to the team, I tried to follow the tracing logic and have some thoughts - it's quite likely that I overlooked some logic/concerns as all of these solr magic is pretty new to me 😓 :

  1. The request start time is logged in https://github.com/fullstorydev/lucene-solr/pull/91/files#diff-9a76a921965533f36983e3171361bf6091f48d2d4683392932a2643433e29e9cR41 , but I cannot seem to find the end time logging. Is it delegated to some other existing logic?

The rest of the comments are mostly just my opinions, I come from a very different background that uses OT traces quite differently 😊 . The current implementation works fine as is, but perhaps there are some elements that a bit confusing to me:

  1. The start operation is logged during the extraction of span context https://github.com/fullstorydev/lucene-solr/pull/91/files#diff-9a76a921965533f36983e3171361bf6091f48d2d4683392932a2643433e29e9cR106 , while this absolute works with the current scenario. However, there's a minor concern that span context extraction does not always result as a span/scope start - one can in theory extract span context many times w/o side effects. This is a very minor concern though, ideally it might be slightly better to log start/end on span creation/close.

  2. The current logic appears to use MDC for status tracking (both writes and reads from tracing logic). This introduce certain cyclic relationship (tracer obtained is determined by MDC values, but at the same time tracer also write to the MDC). While this is likely a legit usage of MDC (since the API allows both read/write), I personally found it a bit cleaner if certain actor is only a writer (set/clear MDC value) OR a reader (read MDC value) but not both. For example, the tracer is likely a writer to MDC, and logging frameworks are usually reader from MDC. One of the way to replace the current "read" operations on the MDC from tracer is to find alternatives to "track" the _req_id. In fact, the existing logic is already using span context baggage which is an ideal target (it propagates across span and process boundary). Therefore I would advise, using only span context baggage to track _req_id. Using span context has another advantage of easier propagation across async operations (still need extra coding but OT is designed for it), while using MDC (threadlocal storage) might not work in async scenarios.

  3. The current Global Tracer is being stateful per request based on trace decision (traced if req header has _req_id OR req header contains valid jaeger context header uberctx-...). This might introduce cyclic relationship as the Tracer makes trace decision while the trace decision itself influence what tracer will be returned by GlobalTracer.getTracer(). The Global tracer works differently in some OT tracer implementation such as Jaeger or newer standard OpenTelemetry - which the Global Tracer(registry) is a simple registry, that it usually only gets set once on system startup. Such design assumes same tracer instance for all request handling, and such tracer would make trace decision and such trace decision is stored in the flow via active scope/span context.

I understand we do not want to do any complete rewrite as the current implementation does work completely fine. I have tried to do a little bit of refactoring to avoid impact to other existing logic, and here is what i tried:

  1. Add a method Span startHttpSpan(HttpServletRequest request) (and call that from SolrDispatchFilter), that checks 2 things : 1. check if there's an explicit _req_id in request header, 2. pass the request to the underlying tracer to extract context. If either 1 or 2 yields anything, then it's a request we want to trace, use the underlying tracer to build the span; otherwise use noop tracer. We can also identify the request type (FORWARD, EXTERNAL) here by checking case 1 vs case 2
  2. In startHttpSpan, if there's an explicit _req_id also set baggage _REQ_ID to the span.
  3. In startHttpSpan, use a decorative span class (for example SolrServletSpan) that simply wraps the span created in 1 (if a request type can be identified, hence a traced request) which logs message on span start (ctro) and end (Span#finish). Return such wrapped span to caller
  4. Add a method getHttpTracer(Span span) in GlobalTracer that simply return the underlying tracer if the span is SolrServletSpan and no-op tracer otherwise, this is not too pretty but unfortunately we need this if we don't want to affect other existing code logic that uses GlobalTracer.getTracer. SolrDispatchFilter can then call GlobalTracer.getHttpTracer(span) and activate the span as scope like the existing code
  5. Add a new ScopeManager that set _req_id to MDC on scope activation, and return a wrapped Scope that remove _req_id from MDC on scope closes. This centralizes all the MDC access. Such scope manager is added to the jaeger tracer builder via .withScopeManager(new ScopeManager() {...} . This promotes an auto sync between tracer scope and MDC context
  6. Remove all MDC read/write operations elsewhere, since it's all managed by ScopeManager now. For example the span context extract/inject should no longer need to set/get from MDC as the req_id is always in the span baggage which get extract/inject automatically by the jaeger built-in extractor/injector.
  7. We can remove class TracerWrapper, as we can use the Jaeger Tracer directly
  8. We can remove interface GlobalTracer.SolrTracer
  9. ThreadLocal IS_FORWARDED can now be removed as the req type info is made directly available and reported in SolrServletSpan
  10. Change the GlobalTracer.getTracer() logic, instead of using mdc and a threadlocal to keep track of state, use the current active SpanContext to determine whether it should return the underlying tracer or no-op tracer.

And for distance future, if Solr wants to do more tracing, I would definitely recommend giving opentelemetry a look as it gives more tracing capability and is an open standard that is actively being worked on 🎉

OpenTracing and OpenCensus have merged to form OpenTelemetry!

Sorry about my long write, I just get a bit too excited with tracing... 😅

@noblepaul
Copy link
Collaborator Author

noblepaul commented Aug 17, 2021

thanks @patsonluk

Hi 👋🏼 , it's nice to see more tracing using standardized framework like OpenTracing! :)

Actually, we weren’t really looking to embrace the OpenTracing framework fully. The requirement was to just log the request id on each node and use BigQuery to do all the analysis. So, this is not a tracing implementation. In this case I could have just used the MDC only to set/get the request id. However, we chose to piggyback on the Jaeger support just to avoid making further changes to Solr in propagating the _req_id in internode requests.

The objectives were:

  • Just log the id in all the nodes
  • Minimize code changes
  • Minimize memory /other overhead
  • To keep the overhead to zero when _req_id is not present

The request start time is logged in https://github.com/fullstorydev/lucene-solr/pull/91/files#diff-9a76a921965533f36983e3171361bf6091f48d2d4683392932a2643433e29e9cR41 , but I cannot seem to find the end time logging. Is it delegated to some other existing logic?

Yes, Solr logs the QTime at the end of each request already out of the box. This whole exercise was aimed at ensuring the req id is propagated and logged, as well as the start of requests are logged. We will need to edit our log4j.xml to add the _req_id in the logs.it is not required to be in the code

The current logic appears to use MDC for status tracking (both writes and reads from tracing logic). This introduce certain cyclic relationship (tracer obtained is determined by MDC values, but at the same time tracer also write to the MDC). While this is likely a legit usage of MDC (since the API allows both read/write), I personally found it a bit cleaner if certain actor is only a writer (set/clear MDC value) OR a reader (read MDC value) but not both. For example, the tracer is likely a writer to MDC, and logging frameworks are usually reader from MDC. One of the way to replace the current "read" operations on the MDC from tracer is to find alternatives to "track" the _req_id. In fact, the existing logic is already using span context baggage which is an ideal target (it propagates across span and process boundary). Therefore I would advise, using only span context baggage to track _req_id.

Your observation is correct, we’re using MDC to set the request IDs, instead of span context. We needed the values in MDC for logging purposes, hence we didn’t go the extra step in populating it/using it to/from the span context. We weren’t looking for a full fledged tracing implementation, and achieved the objective using MDC alone.

Using span context has another advantage of easier propagation across async operations (still need extra coding but OT is designed for it), while using MDC (threadlocal storage) might not work in async scenarios.

The MDC thread local storage can propagate the values even for async requests using MDCAwareThreadPools: https://github.com/apache/lucene-solr/blob/branch_8x/solr/solrj/src/java/org/apache/solr/common/util/ExecutorUtil.java#L102

The current Global Tracer is being stateful per request based on trace decision (traced if req header has _req_id OR req header contains valid jaeger context header uberctx-...). This might introduce cyclic relationship as the Tracer makes trace decision while the trace decision itself influence what tracer will be returned by GlobalTracer.getTracer(). The Global tracer works differently in some OT tracer implementation such as Jaeger or newer standard OpenTelemetry - which the Global Tracer(registry) is a simple registry, that it usually only gets set once on system startup. Such design assumes same tracer instance for all request handling, and such tracer would make trace decision and such trace decision is stored in the flow via active scope/span context.

Since Solr 8x is pinned to using Jaeger, we assume we can rely on Jaeger’s behaviour for GlobalTracer.getTracer(), and hence we used it. There seems no plan to move Solr away from Jaeger to any other implementation (that implements GlobalTracer.getTracer() differently) yet, so we think it is safe to assume the behaviour will remain the same for the Solr 8x or 9x.

I have tried to do a little bit of refactoring to avoid impact to other existing logic, and here is what i tried:

Do you have a branch/patch where you made these changes?

@patsonluk
Copy link

Yes I agree that the current implementation absolutely fulfills the requirements and likely with less change than using full OpenTracing for tracing right now. :)

We can always come back to this if in the future we want to use OpenTracing/OpenTelemetry for fuller tracing. My experimental implementation can be found in https://github.com/fullstorydev/lucene-solr/compare/noble/req_trace...patsonluk:patson/req_trace?expand=1 . It's a pretty rough draft and I have only run the unit test cases 😓

@noblepaul
Copy link
Collaborator Author

We can always come back to this if in the future we want to use OpenTracing/OpenTelemetry for fuller tracing.

absolutely. We did not wish to have a complicated and costly impl.

@hiteshk25
Copy link

@patsonluk @noblepaul should we close this for now

@patsonluk
Copy link

I think it's fine to close it? As the requirement of tracing is fulfilled by using rid for now - unless we want to trace non select queries, then we might need something like this? @noblepaul thoughts please? 😊

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants