Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(image): resolve scan deadlock when error occurs in slow mode #4336

Merged
merged 3 commits into from
May 21, 2023

Conversation

mpoindexter
Copy link
Contributor

@mpoindexter mpoindexter commented May 11, 2023

Description

See #4335

Related issues

Checklist

@mpoindexter mpoindexter requested a review from knqyf263 as a code owner May 11, 2023 19:54
@CLAassistant
Copy link

CLAassistant commented May 11, 2023

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@knqyf263 knqyf263 requested a review from DmitriyLewen May 12, 2023 05:00
@DmitriyLewen
Copy link
Contributor

DmitriyLewen commented May 12, 2023

Hello @mpoindexter
Thanks for your work!

As you said in #4335 i reproduced this case, but i can't find information why semaphore blocks channel. Do you have docs about this case?
Also i found that there is no problem if buffered channel is used (i mean errCh := make(chan error, 1))

@mpoindexter
Copy link
Contributor Author

It's not that the semaphore blocks a channel send directly, the channel send is blocked because it's an unbuffered channel, and nothing is yet reading from the channel. In turn, the goroutine that will eventually be responsible for reading from the channel is blocked because the goroutine trying to send on the channel holds the semaphore which allows at most one holder in slow mode. Using a buffered channel sort of fixes the problem, but for it to be a full fix the buffer size of the channel must be equivalent to max number of errors that could happen, not 1. Otherwise if there are errors on more than one layers the problem can occur.

@DmitriyLewen
Copy link
Contributor

DmitriyLewen commented May 12, 2023

Thanks that you explained in more detail and fast answer!

Copy link
Contributor

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way looks good for me.
Thanks for your work @mpoindexter .
@knqyf263 I approved this PR.

@knqyf263
Copy link
Collaborator

What if putting limit.Acquire into goroutine? The downside is the number of layers would create a goroutine. But I think it is acceptable since goroutine is lightweight, and the number of layers wouldn't be in the thousands.

@mpoindexter
Copy link
Contributor Author

@knqyf263 I moved the limit.Acquire into goroutine. I think that the select blocks around all the sends from the goroutine are needed since the scan is aborted on first error and hence we need to be able to avoid sending for other goroutines once a single one has errored

@knqyf263
Copy link
Collaborator

We may want to use errgroup in this case.
https://pkg.go.dev/golang.org/x/sync/errgroup

@mpoindexter
Copy link
Contributor Author

OK, updated to use errgroup

Comment on lines +252 to +254
if ctx.Err() != nil {
break
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is acceptable to run goroutine even after an error. What if removing this error check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worse to remove it - there's no correctness problem with removing the check, but inspecting a layer can be quite expensive, so it seems like we should stop doing it if we already know that we're going to get an error result when we call group.Wait()

}()

layerKey := k
ctx := groupCtx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groupCtx is not updated in the loop. Is there any specific reason to overwrite ctx every time here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it looked cleaner to pass ctx in the body of the goroutine. The code previously took ctx as an argument to the goroutine, but with errgroup we can't pass arguments to the goroutine, so just binding some variables was the replacement. We could just change to use groupCtx within the goroutine, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trivy can hang if an error occurs scanning a container image with --slow option
4 participants