Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

Adding locks where context is accessed #528

Closed
wants to merge 1 commit into from

Conversation

keer25
Copy link

@keer25 keer25 commented Aug 25, 2020

This commit addresses data race issues where span.context is accessed without locks in methods which can be called concurrently with method setBaggageItem which modifies context

Resolves issue #526

Signed-off-by: Keerthana Selvakumar keerukeerthana8@gmail.com

Which problem is this PR solving?

  • This commit addresses data race issues where span.context is accessed without locks in methods which can be called concurrently with method setBaggageItem which modifies context

Short description of the changes

  • Called locks to fix data races

@keer25 keer25 force-pushed the master branch 2 times, most recently from 665f930 to 255ce9d Compare August 25, 2020 21:17
@codecov
Copy link

codecov bot commented Aug 25, 2020

Codecov Report

Merging #528 into master will increase coverage by 0.02%.
The diff coverage is 94.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #528      +/-   ##
==========================================
+ Coverage   89.28%   89.30%   +0.02%     
==========================================
  Files          61       61              
  Lines        3919     3918       -1     
==========================================
  Hits         3499     3499              
+ Misses        294      292       -2     
- Partials      126      127       +1     
Impacted Files Coverage Δ
tracer.go 95.91% <50.00%> (ø)
span.go 96.99% <100.00%> (-0.02%) ⬇️
utils/reconnecting_udp_conn.go 90.00% <0.00%> (-2.50%) ⬇️
internal/baggage/remote/restriction_manager.go 100.00% <0.00%> (+4.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf8927b...f2804f4. Read the comment docs.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joe-elliott you might be interested in reviewing this.

@@ -427,10 +427,10 @@ func (s *Span) serviceName() string {

func (s *Span) applySamplingDecision(decision SamplingDecision, lock bool) {
if !decision.Retryable {
s.context.samplingState.setFinal()
s.SpanContext().samplingState.setFinal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The presence of the lock parameter in this function indicates that it might be called while the lock on the span is already acquired by the current goroutine. What would happen in that case if we call SpanContext() that will try to acquire a read-lock? I remember in the past that in Go the RWMutex is not re-entrable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah rwmutex is not re-entrable, but so far this is not called in places where the caller already obtains a lock, the method is thread-safe in itself, so the caller shouldn't need to obtain locks, added documentation for that. lock param is set to false only in the start span method where too no locks are obtained.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it's dangerous to ignore the lock parameter even if right now there doesn't happen to be a codepath that deadlocks. Perhaps access s.context directly here and move the if lock code to the top?

Alternatively the way it's been changed we're always locking and we've already taken the performance hit. We could just remove the lock param. Depending on benchmarks this may be an option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is true, I started making changes that way to apply lock while calling all these methods, but this method also calls sampler and observer callbacks like sampler.OnSetOperationName(), etc., which won't be in the same package and have to call with locks, and there is high risk of these calling locks than callers to these internal methods.

We can call locks around these call backs, the code is going to lock and unlock twice at-least in each method, one before and after sampler callbacks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (s *Span) setTagInternal(key string, value interface{}, lock bool) opentracing.Span {
	s.observer.OnSetTag(key, value)
	if lock {
		s.RLock()
	}
	if key == string(ext.SamplingPriority) && !setSamplingPriority(s, value) {
		return s
	}
	if !s.isSamplingFinalized() {
		if lock {
			s.RUnlock()
		}
		decision := s.tracer.sampler.OnSetTag(s, key, value)
		if lock {
			s.Lock()
			defer s.Unlock()
		}
		s.applySamplingDecision(decision, lock)
	} else {
		if lock {
			s.Lock()
			defer s.Unlock()
		}
	}
	if s.isWriteable() {
		s.appendTagNoLocking(key, value)
	}
	return s
}

Copy link
Author

@keer25 keer25 Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how some three methods in span.go would look, if this looks good we can go with this.
As a rule not a good idea to hold a lock while calling external methods (like sampler callbacks), but internal methods we can document if it is to be called with or without locks. And I believe the lock parameter was just so locks are not called when we need not bother about concurrency like when the span is started.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some benchmarks can tell which approach is better, I will add some micro benchmarks over the weekend

span.go Outdated Show resolved Hide resolved
span.go Outdated Show resolved Hide resolved
@@ -363,7 +363,6 @@ func TestSpan_References(t *testing.T) {
}

func TestSpanContextRaces(t *testing.T) {
t.Skip("Skipped: test will panic with -race, see https://github.com/jaegertracing/jaeger-client-go/issues/526")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to add a few more mutators to this test, namely those that affect samplingState and flags.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah added a few, and also called span.Finish in the end, and found a few more data race conditions in the reportSpan() method and modified it

This commit addresses data race issues where span.context is accessed without locks in methods which can be called concurrently with method setBaggageItem which modifies context

Per issue jaegertracing#526

Signed-off-by: Keerthana Selvakumar <keerukeerthana8@gmail.com>
@@ -427,10 +427,10 @@ func (s *Span) serviceName() string {

func (s *Span) applySamplingDecision(decision SamplingDecision, lock bool) {
if !decision.Retryable {
s.context.samplingState.setFinal()
s.SpanContext().samplingState.setFinal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it's dangerous to ignore the lock parameter even if right now there doesn't happen to be a codepath that deadlocks. Perhaps access s.context directly here and move the if lock code to the top?

Alternatively the way it's been changed we're always locking and we've already taken the performance hit. We could just remove the lock param. Depending on benchmarks this may be an option.

@@ -120,8 +123,8 @@ func (s *Span) setTagInternal(key string, value interface{}, lock bool) opentrac

// SpanContext returns span context
func (s *Span) SpanContext() SpanContext {
s.Lock()
defer s.Unlock()
s.RLock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of other places that hold .Lock() when only .RLock() is needed. Seeing .StartTime(), .Duration(), .Tags() for instance. Probably not in the scope of this fix though.

})
go accessor(func() {
span.SpanContext().samplingState.setFlag(flagDebug)
})
Copy link
Member

@joe-elliott joe-elliott Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe reset still races, but unsure if this is worth protecting. This is only called when the syncpool is enabled and generally the lifecycle code seems dangerous. It would be trivial for the calling application to hold a *Span after .Finish() and do terrible things with it so I think we have to trust the calling code.

Added a comment to the main thread with some final thoughts.

@@ -439,7 +439,7 @@ func (t *Tracer) emitNewSpanMetrics(sp *Span, newTrace bool) {
func (t *Tracer) reportSpan(sp *Span) {
if !sp.isSamplingFinalized() {
t.metrics.SpansFinishedDelayedSampling.Inc(1)
} else if sp.context.IsSampled() {
} else if sp.SpanContext().IsSampled() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store the value of SpanContext().IsSampled() at the top of the method to only call SpanContext() once?

@@ -445,12 +451,12 @@ func (s *Span) applySamplingDecision(decision SamplingDecision, lock bool) {

// Span can be written to if it is sampled or the sampling decision has not been finalized.
func (s *Span) isWriteable() bool {
state := s.context.samplingState
state := s.SpanContext().samplingState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isWriteable and isSamplingFinalized are called beneath setTagInternal which receives a lock param that is ignored here.

@joe-elliott
Copy link
Member

joe-elliott commented Aug 27, 2020

So I've spent some time thinking about this and the following makes sense to me.

  • Change isWriteable(), isSamplingFinalized() and setSamplingPriority() to take a SpanContext. These functions will no longer have to care about locking b/c we consider the SpanContext immutable. Then the calling code can make a choice to either use .SpanContext() or .context depending on whether or not it should or should not lock.
    • isSamplingFinalized is the only method called in multiple places and I think its straightforward in each place if locking is needed or not.
  • Make applySamplingDecision and setTagInternal obey their lock parameters (including Context access). I think this can be done relatively cleanly with something like this at the top. Then ctx can just be passed to all the helper methods. (Good call on not making a callback under lock.)
    var ctx SpanContext
    if lock {
      ctx = span.SpanContext()
    } else {
      ctx = span.context
    }
    
  • FinishWithOptions -> tracer.reportSpan() -> reset() does need to be protected.
  • Add reset() and SetOperationName() to TestSpanContextRaces()

@yurishkuro does this make sense?

@yurishkuro
Copy link
Member

@joe-elliott

Change isWriteable(), isSamplingFinalized() and setSamplingPriority() to take a SpanContext.

This bothers me a little. The first two are actually derived from the settings stored in the SpanContext, so passing them the SpanContext seems backwards, could we just call them on the SpanContext directly?

setSamplingPriority - I'd instead pass the samplingState than SpanContext.

I think doing it this way reduces the coupling.

@joe-elliott
Copy link
Member

This bothers me a little. The first two are actually derived from the settings stored in the SpanContext, so passing them the SpanContext seems backwards, could we just call them on the SpanContext directly?

I see no issue with having a SpanContext as a receiver instead of taking one as a parameter. It's kind of a stylistic choice and I was considering suggesting it as well. 👍

setSamplingPriority - I'd instead pass the samplingState than SpanContext.

Agree.

@joe-elliott
Copy link
Member

@keer25 are you still working on this issue? if not, i would like to take it up.

@yurishkuro
Copy link
Member

Addressed differently in #544. But thank you @keer25 for the first attempt, it informed the final solution.

@yurishkuro yurishkuro closed this Oct 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants