Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in log framer only affecting Windows #3608

Merged
merged 5 commits into from
Dec 8, 2017

Conversation

schmichael
Copy link
Member

Fixes #3342

Manually tested on Windows and Linux using the reproducer attached to #3342 on Windows. Although there's nothing special about the job: streaming logs from any job on Windows would cause the issue.

The root cause is that a super subtle misused variable bug in parseFramerError would cause a nil error to be returned to after the framer had exited. HTTPServer.logs would see the nil error and think it should continue, but since the framer had exited it would immediately return an error which parseFramerError would again discard and the loop would repeat.

Normally this sort of tight loop would merely consume a single cpu core and be a nuisance. However this loop spawned a goroutine which would very quickly consume all memory until Nomad itself would OOM.

The 4 commits actually contain distinct fixes:

  • 88cb8bb updates some very old dependencies. I was hoping that would fix it, but it did not. Therefore I'm happy to remove this commit, but I think keeping these deps up to date is ideal.
  • ac3fffc fixes at least 2 races in StreamFramer, however, I did not see any evidence either was hit.
  • ac3fffc also fixes the error handling bug in parseFramerError 👈
  • 13a69bd is purely defensive code that should not be hit. I hate adding code like that and will remove if asked. I added it in this case because the surrounding code is extremely brittle and it would be extremely easy to cause The nomad logs -f command can make windows clients unresponsive/crash. #3342 another way without this safety check.
  • b4bacd6 adds a check for Windows special flavor of "connection closed" error. I looked and looked and found no other way to detect it other than that string comparison. It merely squelches an error from being logged, so the worst case scenario isn't so bad if the string is sometimes wrong (it worked in my testing).

Fixes #3342

Two bugs were fixed:

* Closing the StreamFramer's exitCh before setting the error means other
  goroutines blocked on exitCh closing could see the error as nil. This
  was *not* observered.
* parseFramerError on Windows would fall through and return an
  improperly captured nil err variable. There's no need for
  parseFramerError to be a closure which fixes the confusion.
I hate adding "this should never happen" checks, but causing a tight
loop that OOMs Nomad is just too easy in this code otherwise.
Copy link
Contributor

@chelseakomlo chelseakomlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A couple questions about concurrency and where logic should be for translating the error type.

@@ -466,8 +473,8 @@ func (s *StreamFramer) Send(file, fileEvent string, data []byte, offset int64) e
// If we are not running, return the error that caused us to not run or
// indicated that it was never started.
if !s.running {
if s.Err != nil {
return s.Err
if s.err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be s.Err()? Otherwise add a comment about why this can be accessed directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for line 477

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method holds the appropriate lock (L470)

select {
case <-framer.ExitCh():
err := parseFramerErr(framer.Err())
if err == syscall.EPIPE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be in parseFramerErr instead, since it already has logic related to translating errors? I.e, does the caller need to do the translation of syscall.EPIPE -> nil error directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it returns the error from parseFramerErr in line 740, but has this extra logic here where it turns that to nil. Good question though - why do we treat the error two different ways?

Copy link
Contributor

@preetapan preetapan Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this again, I am not convinced this block is necessary (saw your comment about defensive code). But if its here, I would make it return the value of parseFramerErr rather than nil to make handling exit channel events identical in both cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes more sense in the actual code and not in the code review I promise. :) The code review cuts out the code 10 lines up that follow similar return behavior when handling the error from s.stream()

The error returned from s.stream() is already passed through parseFramerErr(), then line 924 converts EPIPE to nil in this func.

So this defensive code converts the framer error with the helper func and then follows the same if err == EPIPE { return nil } behavior as above.

Sorry the code is obtuse but the good news is that it only determines whether or not an error is logged when a client disconnects (EPIPE errors are not logged as they're expected).

The important thing in this defensive code is that we return from this infinite for{} loop if the framer has exited. The error code only matters for logging.

s.l.Lock()
s.running = false
s.Err = err
s.err = err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be set using a SetErr() method that has guards against concurrent access?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The appropriate lock is acquired on L394 and this should be the only place that sets s.err so I don't want to add a helper to encourage other uses.

@dadgar
Copy link
Contributor

dadgar commented Dec 5, 2017

Does this fix: #2840?

@schmichael
Copy link
Member Author

Does this fix: #2840?
@dadgar

I'm unsure. It sounds the same, but I could only reproduce this bug on Windows. Only 16 of the 120 goroutines listed in the snippet of output include the telltale blockUntilNextLog function that dominated the vast majority of the goroutines in this case.

@schmichael schmichael merged commit 0921368 into master Dec 8, 2017
@schmichael schmichael deleted the b-3342-windows-log-leak branch December 8, 2017 18:59
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The nomad logs -f command can make windows clients unresponsive/crash.
4 participants