-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in log framer only affecting Windows #3608
Conversation
Fixes #3342 Two bugs were fixed: * Closing the StreamFramer's exitCh before setting the error means other goroutines blocked on exitCh closing could see the error as nil. This was *not* observered. * parseFramerError on Windows would fall through and return an improperly captured nil err variable. There's no need for parseFramerError to be a closure which fixes the confusion.
I hate adding "this should never happen" checks, but causing a tight loop that OOMs Nomad is just too easy in this code otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! A couple questions about concurrency and where logic should be for translating the error type.
@@ -466,8 +473,8 @@ func (s *StreamFramer) Send(file, fileEvent string, data []byte, offset int64) e | |||
// If we are not running, return the error that caused us to not run or | |||
// indicated that it was never started. | |||
if !s.running { | |||
if s.Err != nil { | |||
return s.Err | |||
if s.err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be s.Err()
? Otherwise add a comment about why this can be accessed directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for line 477
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method holds the appropriate lock (L470)
select { | ||
case <-framer.ExitCh(): | ||
err := parseFramerErr(framer.Err()) | ||
if err == syscall.EPIPE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this logic be in parseFramerErr instead, since it already has logic related to translating errors? I.e, does the caller need to do the translation of syscall.EPIPE
-> nil error directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it returns the error from parseFramerErr in line 740, but has this extra logic here where it turns that to nil. Good question though - why do we treat the error two different ways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this again, I am not convinced this block is necessary (saw your comment about defensive code). But if its here, I would make it return the value of parseFramerErr rather than nil
to make handling exit channel events identical in both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes more sense in the actual code and not in the code review I promise. :) The code review cuts out the code 10 lines up that follow similar return behavior when handling the error from s.stream()
The error returned from s.stream()
is already passed through parseFramerErr()
, then line 924 converts EPIPE to nil in this func.
So this defensive code converts the framer error with the helper func and then follows the same if err == EPIPE { return nil }
behavior as above.
Sorry the code is obtuse but the good news is that it only determines whether or not an error is logged when a client disconnects (EPIPE errors are not logged as they're expected).
The important thing in this defensive code is that we return from this infinite for{}
loop if the framer has exited. The error code only matters for logging.
s.l.Lock() | ||
s.running = false | ||
s.Err = err | ||
s.err = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be set using a SetErr()
method that has guards against concurrent access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The appropriate lock is acquired on L394 and this should be the only place that sets s.err
so I don't want to add a helper to encourage other uses.
Does this fix: #2840? |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #3342
Manually tested on Windows and Linux using the reproducer attached to #3342 on Windows. Although there's nothing special about the job: streaming logs from any job on Windows would cause the issue.
The root cause is that a super subtle misused variable bug in
parseFramerError
would cause anil
error to be returned to after the framer had exited.HTTPServer.logs
would see thenil
error and think it should continue, but since the framer had exited it would immediately return an error whichparseFramerError
would again discard and the loop would repeat.Normally this sort of tight loop would merely consume a single cpu core and be a nuisance. However this loop spawned a goroutine which would very quickly consume all memory until Nomad itself would OOM.
The 4 commits actually contain distinct fixes:
parseFramerError
👈nomad logs -f
command can make windows clients unresponsive/crash. #3342 another way without this safety check.