Request tracing #91

noblepaul · 2021-07-09T06:16:37Z

Please refer to https://docs.google.com/document/d/1-U0x2Y_m-2TemzBgryYzy7JyLEl3FcBN-G5Tftrr634/edit?ts=60e61406#

for more details

hiteshk25 · 2021-07-13T21:25:43Z

solr/core/src/java/org/apache/solr/core/FSTracerConfigurator.java

+    String reqId = MDC.get(_REQ_ID);
+    if (reqId == null) return;
+    if (log.isInfoEnabled()) {
+      String type = IS_FORWARDED.get() ? "forwarded" : "external";


using this can we identify top level of request? (top query request vs subrequest for shard)

solr/core/src/java/org/apache/solr/core/FSTracerConfigurator.java

hiteshk25 · 2021-07-13T21:39:42Z

Would it possible to add unit test for it?

solr/core/src/java/org/apache/solr/util/tracing/HttpServletCarrier.java

hiteshk25 · 2021-07-21T20:02:07Z

@noblepaul followup,

would it possible to add test?
Will it have impact on cpu or memory?

noblepaul · 2021-07-22T09:23:40Z

would it possible to add test?

done

Will it have impact on cpu or memory?

This has very little impact on CPU/memory. My own perf tests showed an extra 0.5milliseconds in increased latency

hiteshk25 · 2021-07-26T15:59:10Z

Thanks @noblepaul. i will merge this further!

patsonluk · 2021-08-10T16:53:55Z

Hi 👋🏼 , it's nice to see more tracing using standardized framework like OpenTracing! :)

Since I am very new to the team, I tried to follow the tracing logic and have some thoughts - it's quite likely that I overlooked some logic/concerns as all of these solr magic is pretty new to me 😓 :

The request start time is logged in https://github.com/fullstorydev/lucene-solr/pull/91/files#diff-9a76a921965533f36983e3171361bf6091f48d2d4683392932a2643433e29e9cR41 , but I cannot seem to find the end time logging. Is it delegated to some other existing logic?

The rest of the comments are mostly just my opinions, I come from a very different background that uses OT traces quite differently 😊 . The current implementation works fine as is, but perhaps there are some elements that a bit confusing to me:

The start operation is logged during the extraction of span context https://github.com/fullstorydev/lucene-solr/pull/91/files#diff-9a76a921965533f36983e3171361bf6091f48d2d4683392932a2643433e29e9cR106 , while this absolute works with the current scenario. However, there's a minor concern that span context extraction does not always result as a span/scope start - one can in theory extract span context many times w/o side effects. This is a very minor concern though, ideally it might be slightly better to log start/end on span creation/close.
The current logic appears to use MDC for status tracking (both writes and reads from tracing logic). This introduce certain cyclic relationship (tracer obtained is determined by MDC values, but at the same time tracer also write to the MDC). While this is likely a legit usage of MDC (since the API allows both read/write), I personally found it a bit cleaner if certain actor is only a writer (set/clear MDC value) OR a reader (read MDC value) but not both. For example, the tracer is likely a writer to MDC, and logging frameworks are usually reader from MDC. One of the way to replace the current "read" operations on the MDC from tracer is to find alternatives to "track" the _req_id. In fact, the existing logic is already using span context baggage which is an ideal target (it propagates across span and process boundary). Therefore I would advise, using only span context baggage to track _req_id. Using span context has another advantage of easier propagation across async operations (still need extra coding but OT is designed for it), while using MDC (threadlocal storage) might not work in async scenarios.
The current Global Tracer is being stateful per request based on trace decision (traced if req header has _req_id OR req header contains valid jaeger context header uberctx-...). This might introduce cyclic relationship as the Tracer makes trace decision while the trace decision itself influence what tracer will be returned by GlobalTracer.getTracer(). The Global tracer works differently in some OT tracer implementation such as Jaeger or newer standard OpenTelemetry - which the Global Tracer(registry) is a simple registry, that it usually only gets set once on system startup. Such design assumes same tracer instance for all request handling, and such tracer would make trace decision and such trace decision is stored in the flow via active scope/span context.

I understand we do not want to do any complete rewrite as the current implementation does work completely fine. I have tried to do a little bit of refactoring to avoid impact to other existing logic, and here is what i tried:

Add a method Span startHttpSpan(HttpServletRequest request) (and call that from SolrDispatchFilter), that checks 2 things : 1. check if there's an explicit _req_id in request header, 2. pass the request to the underlying tracer to extract context. If either 1 or 2 yields anything, then it's a request we want to trace, use the underlying tracer to build the span; otherwise use noop tracer. We can also identify the request type (FORWARD, EXTERNAL) here by checking case 1 vs case 2
In startHttpSpan, if there's an explicit _req_id also set baggage _REQ_ID to the span.
In startHttpSpan, use a decorative span class (for example SolrServletSpan) that simply wraps the span created in 1 (if a request type can be identified, hence a traced request) which logs message on span start (ctro) and end (Span#finish). Return such wrapped span to caller
Add a method getHttpTracer(Span span) in GlobalTracer that simply return the underlying tracer if the span is SolrServletSpan and no-op tracer otherwise, this is not too pretty but unfortunately we need this if we don't want to affect other existing code logic that uses GlobalTracer.getTracer. SolrDispatchFilter can then call GlobalTracer.getHttpTracer(span) and activate the span as scope like the existing code
Add a new ScopeManager that set _req_id to MDC on scope activation, and return a wrapped Scope that remove _req_id from MDC on scope closes. This centralizes all the MDC access. Such scope manager is added to the jaeger tracer builder via .withScopeManager(new ScopeManager() {...} . This promotes an auto sync between tracer scope and MDC context
Remove all MDC read/write operations elsewhere, since it's all managed by ScopeManager now. For example the span context extract/inject should no longer need to set/get from MDC as the req_id is always in the span baggage which get extract/inject automatically by the jaeger built-in extractor/injector.
We can remove class TracerWrapper, as we can use the Jaeger Tracer directly
We can remove interface GlobalTracer.SolrTracer
ThreadLocal IS_FORWARDED can now be removed as the req type info is made directly available and reported in SolrServletSpan
Change the GlobalTracer.getTracer() logic, instead of using mdc and a threadlocal to keep track of state, use the current active SpanContext to determine whether it should return the underlying tracer or no-op tracer.

And for distance future, if Solr wants to do more tracing, I would definitely recommend giving opentelemetry a look as it gives more tracing capability and is an open standard that is actively being worked on 🎉

OpenTracing and OpenCensus have merged to form OpenTelemetry!

Sorry about my long write, I just get a bit too excited with tracing... 😅

noblepaul · 2021-08-17T06:25:13Z

thanks @patsonluk

Hi 👋🏼 , it's nice to see more tracing using standardized framework like OpenTracing! :)

Actually, we weren’t really looking to embrace the OpenTracing framework fully. The requirement was to just log the request id on each node and use BigQuery to do all the analysis. So, this is not a tracing implementation. In this case I could have just used the MDC only to set/get the request id. However, we chose to piggyback on the Jaeger support just to avoid making further changes to Solr in propagating the _req_id in internode requests.

The objectives were:

Just log the id in all the nodes
Minimize code changes
Minimize memory /other overhead
To keep the overhead to zero when _req_id is not present

The request start time is logged in https://github.com/fullstorydev/lucene-solr/pull/91/files#diff-9a76a921965533f36983e3171361bf6091f48d2d4683392932a2643433e29e9cR41 , but I cannot seem to find the end time logging. Is it delegated to some other existing logic?

Yes, Solr logs the QTime at the end of each request already out of the box. This whole exercise was aimed at ensuring the req id is propagated and logged, as well as the start of requests are logged. We will need to edit our log4j.xml to add the _req_id in the logs.it is not required to be in the code

The current logic appears to use MDC for status tracking (both writes and reads from tracing logic). This introduce certain cyclic relationship (tracer obtained is determined by MDC values, but at the same time tracer also write to the MDC). While this is likely a legit usage of MDC (since the API allows both read/write), I personally found it a bit cleaner if certain actor is only a writer (set/clear MDC value) OR a reader (read MDC value) but not both. For example, the tracer is likely a writer to MDC, and logging frameworks are usually reader from MDC. One of the way to replace the current "read" operations on the MDC from tracer is to find alternatives to "track" the _req_id. In fact, the existing logic is already using span context baggage which is an ideal target (it propagates across span and process boundary). Therefore I would advise, using only span context baggage to track _req_id.

Your observation is correct, we’re using MDC to set the request IDs, instead of span context. We needed the values in MDC for logging purposes, hence we didn’t go the extra step in populating it/using it to/from the span context. We weren’t looking for a full fledged tracing implementation, and achieved the objective using MDC alone.

Using span context has another advantage of easier propagation across async operations (still need extra coding but OT is designed for it), while using MDC (threadlocal storage) might not work in async scenarios.

The MDC thread local storage can propagate the values even for async requests using MDCAwareThreadPools: https://github.com/apache/lucene-solr/blob/branch_8x/solr/solrj/src/java/org/apache/solr/common/util/ExecutorUtil.java#L102

The current Global Tracer is being stateful per request based on trace decision (traced if req header has _req_id OR req header contains valid jaeger context header uberctx-...). This might introduce cyclic relationship as the Tracer makes trace decision while the trace decision itself influence what tracer will be returned by GlobalTracer.getTracer(). The Global tracer works differently in some OT tracer implementation such as Jaeger or newer standard OpenTelemetry - which the Global Tracer(registry) is a simple registry, that it usually only gets set once on system startup. Such design assumes same tracer instance for all request handling, and such tracer would make trace decision and such trace decision is stored in the flow via active scope/span context.

Since Solr 8x is pinned to using Jaeger, we assume we can rely on Jaeger’s behaviour for GlobalTracer.getTracer(), and hence we used it. There seems no plan to move Solr away from Jaeger to any other implementation (that implements GlobalTracer.getTracer() differently) yet, so we think it is safe to assume the behaviour will remain the same for the Solr 8x or 9x.

I have tried to do a little bit of refactoring to avoid impact to other existing logic, and here is what i tried:

Do you have a branch/patch where you made these changes?

patsonluk · 2021-08-17T15:31:01Z

Yes I agree that the current implementation absolutely fulfills the requirements and likely with less change than using full OpenTracing for tracing right now. :)

We can always come back to this if in the future we want to use OpenTracing/OpenTelemetry for fuller tracing. My experimental implementation can be found in https://github.com/fullstorydev/lucene-solr/compare/noble/req_trace...patsonluk:patson/req_trace?expand=1 . It's a pretty rough draft and I have only run the unit test cases 😓

noblepaul · 2021-08-18T01:46:31Z

We can always come back to this if in the future we want to use OpenTracing/OpenTelemetry for fuller tracing.

absolutely. We did not wish to have a complicated and costly impl.

hiteshk25 · 2021-12-08T17:50:34Z

@patsonluk @noblepaul should we close this for now

patsonluk · 2021-12-08T18:31:42Z

I think it's fine to close it? As the requirement of tracing is fulfilled by using rid for now - unless we want to trace non select queries, then we might need something like this? @noblepaul thoughts please? 😊

noblepaul added 3 commits July 8, 2021 18:14

Support for client provided request tracing

50749d6

cleanup

87c1eed

removed opentracing mock dependency

426578b

noblepaul requested review from hiteshk25, chatman and donjaime July 9, 2021 06:18

hiteshk25 reviewed Jul 13, 2021

View reviewed changes

solr/core/src/java/org/apache/solr/core/FSTracerConfigurator.java Show resolved Hide resolved

hiteshk25 reviewed Jul 13, 2021

View reviewed changes

solr/core/src/java/org/apache/solr/util/tracing/HttpServletCarrier.java Show resolved Hide resolved

added a testcase

093ed6f

patsonluk mentioned this pull request Sep 17, 2021

Log cleanup and req tracing #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request tracing #91

Request tracing #91

noblepaul commented Jul 9, 2021 •

edited

Loading

hiteshk25 Jul 13, 2021

noblepaul Jul 13, 2021

hiteshk25 commented Jul 13, 2021

hiteshk25 commented Jul 21, 2021

noblepaul commented Jul 22, 2021 •

edited

Loading

hiteshk25 commented Jul 26, 2021 •

edited

Loading

patsonluk commented Aug 10, 2021 •

edited

Loading

noblepaul commented Aug 17, 2021 •

edited

Loading

patsonluk commented Aug 17, 2021

noblepaul commented Aug 18, 2021

hiteshk25 commented Dec 8, 2021

patsonluk commented Dec 8, 2021

Request tracing #91

Are you sure you want to change the base?

Request tracing #91

Conversation

noblepaul commented Jul 9, 2021 • edited Loading

hiteshk25 Jul 13, 2021

Choose a reason for hiding this comment

noblepaul Jul 13, 2021

Choose a reason for hiding this comment

hiteshk25 commented Jul 13, 2021

hiteshk25 commented Jul 21, 2021

noblepaul commented Jul 22, 2021 • edited Loading

hiteshk25 commented Jul 26, 2021 • edited Loading

patsonluk commented Aug 10, 2021 • edited Loading

noblepaul commented Aug 17, 2021 • edited Loading

patsonluk commented Aug 17, 2021

noblepaul commented Aug 18, 2021

hiteshk25 commented Dec 8, 2021

patsonluk commented Dec 8, 2021

noblepaul commented Jul 9, 2021 •

edited

Loading

noblepaul commented Jul 22, 2021 •

edited

Loading

hiteshk25 commented Jul 26, 2021 •

edited

Loading

patsonluk commented Aug 10, 2021 •

edited

Loading

noblepaul commented Aug 17, 2021 •

edited

Loading