contenthash: improve the correctness of needsScan #5060

cyphar · 2024-06-19T06:00:20Z

Commit f724d6f ("contenthash: implement proper Linux symlink semantics for needsScan") fixed issues with needScan's handling of symlinks, but the logic used to figure out if a parent path is in the cache was incorrect in a couple of ways:

The optimisation in getFollowSymlinksCallback to avoid looking up / in the cache lead to the callback not being called when we go through /. The upshot is that needsScan(/non-existent-path) would always return true because we didn't check if / has been scanned.
Similarly, the logic of skipping a cache lookup if the path is the same as the current path meant that we skipped the first / lookup (because we start with currentPath=/).
Because needsScan would only store the last good path, cases with symlink jumps to non-existent paths within directories already scanned would result in a re-scan that isn't necessary. Fix this by saving a set of prefix paths we have seen. Note that in combination with (1) and (2), if / has been scanned then needsScan will always return false now (because the / prefix is always checked against).

Fixes: f724d6f ("contenthash: implement proper Linux symlink semantics for needsScan")
Fixes #5042
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

cyphar · 2024-06-19T06:01:57Z

@tonistiigi Based on my testing, this fixes the issue you described as well as having a few other fixes that should improve other possible performance regressions (all mainly dealing with non-existent files).

I can cook up a few tests that call needsScan and scanPath directly if those are the kind of tests you'd like to add? (It's not really possible to test this issue with Checksum directly.)

cyphar · 2024-06-19T06:27:45Z

pathSet is the simplest way of implementing the structure needed for the prefix checks. If you feel a more complex structure is needed, let me know -- but because we do path lookups from left-to-right now, in practice the very early ancestors will be at the head of the array of prefixes so I don't think it needs anything more complicated than a simple array.

tonistiigi

pathSet is the simplest way of implementing the structure needed for the prefix checks. If you feel a more complex structure is needed, let me know

I guess another radix tree would work well for this but assuming the length of prefixes array is always expected to be small, the potentially inefficient lookup in includes shouldn't matter for practical cases.

I can cook up a few tests that call needsScan and scanPath directly if those are the kind of tests you'd like to add?

That sgtm, but we could also add some private counters logic using private variables that can be turned on by the tests so we can see how many times the scanning/walking happens for certain conditions and detect if some future change causes the more expensive scanning part happen more often than expected.

cache/contenthash/checksum.go

cyphar · 2024-06-20T06:01:37Z

I guess another radix tree would work well for this but assuming the length of prefixes array is always expected to be small, the potentially inefficient lookup in includes shouldn't matter for practical cases.

Ah yeah, go-immutable-radix has LongestPrefix which would work for this. But yeah, I think in practice it won't matter (and I'm not sure that it would be better for most cases anyway because this use is not write-few-read-many, it's write-many-read-few, so the copies in each Insert are probably not worth it in practice).

That sgtm, but we could also add some private counters logic using private variables that can be turned on by the tests so we can see how many times the scanning/walking happens for certain conditions and detect if some future change causes the more expensive scanning part happen more often than expected.

Already working on the needsScan tests (hence the draft). I'll add some counters as well while I'm at it, though I suspect that testing needsScan should be sufficient.

Commit f724d6f ("contenthash: implement proper Linux symlink semantics for needsScan") fixed issues with needScan's handling of symlinks, but the logic used to figure out if a parent path is in the cache was incorrect in two ways: 1. The optimisations in getFollowSymlinksCallback to avoid looking up / or no-op components in the cache lead to the callback not being called when we go through / (both in for the initial currentPath=/ state and for absolute symlinks). The upshot is that needsScan(/non-existent-path) would always return true because we didn't check if / has been scanned. There were also some issues with the wrong cache record being returned if you hit a symlink to only /. These optimisations only make sense if are only returning a path (as in rootPath) and not anything else. 2. Because needsScan would only store the _last_ good path, cases with symlink jumps to non-existent paths within directories already scanned would result in a re-scan that isn't necessary. Fix this by saving a set of prefix paths we have seen. This change also means that we can also simplify the logic in the needsScan callback. Note that in combination with (1) and (2), if / has been scanned then needsScan will always return false now (because the / prefix is always checked against, and every path has it as a parent). The pathSet structure is the "dumb" way of doing it. We could use a radix tree and LongestPrefix() to implement it in a smarter way, but in practice the list will be very small and is both short-lived and write-many-read-few, so go-immutable-radix's node copy cost probably makes the array better in practice anyway. Fixes: f724d6f ("contenthash: implement proper Linux symlink semantics for needsScan") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Most of these tests revolve around making sure that we are scanning the correct path during Checksum, and that we don't think a scan is required for subpaths we have already scanned. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

While we now have tests ensuring that needsScan does not regress and indicate that a scan is neccessary, it seems prudent to also include checks that scanPath is definitely not running on any new paths when we don't expect it. Suggested-by: Tonis Tiigi <tonistiigi@gmail.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

cyphar force-pushed the contenthash-scanpath-slash branch from b37ddba to 918c9f9 Compare June 19, 2024 06:17

cyphar force-pushed the contenthash-scanpath-slash branch 2 times, most recently from 3b9491e to 542e4cb Compare June 19, 2024 11:17

cyphar marked this pull request as draft June 20, 2024 04:27

tonistiigi reviewed Jun 20, 2024

View reviewed changes

cache/contenthash/checksum.go Show resolved Hide resolved

cyphar force-pushed the contenthash-scanpath-slash branch 2 times, most recently from 39e3281 to 6dff9ad Compare June 20, 2024 06:56

cyphar marked this pull request as ready for review June 20, 2024 07:00

cyphar force-pushed the contenthash-scanpath-slash branch from 6dff9ad to 9a5ddc0 Compare June 20, 2024 07:12

thompson-shaun added this to the v0.14.2 milestone Jun 20, 2024

cyphar force-pushed the contenthash-scanpath-slash branch 3 times, most recently from a485d4d to 8179763 Compare June 21, 2024 04:39

tonistiigi approved these changes Jun 21, 2024

View reviewed changes

cyphar added 3 commits June 22, 2024 00:47

contenthash: add tests to check needsScan's correctness

0d99599

Most of these tests revolve around making sure that we are scanning the correct path during Checksum, and that we don't think a scan is required for subpaths we have already scanned. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

cyphar force-pushed the contenthash-scanpath-slash branch from 8179763 to 7b630eb Compare June 21, 2024 14:47

tonistiigi merged commit edcf130 into moby:master Jun 21, 2024
76 checks passed

cyphar deleted the contenthash-scanpath-slash branch June 22, 2024 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contenthash: improve the correctness of needsScan #5060

contenthash: improve the correctness of needsScan #5060

cyphar commented Jun 19, 2024

cyphar commented Jun 19, 2024

cyphar commented Jun 19, 2024

tonistiigi left a comment

cyphar commented Jun 20, 2024 •

edited

Loading

contenthash: improve the correctness of needsScan #5060

contenthash: improve the correctness of needsScan #5060

Conversation

cyphar commented Jun 19, 2024

cyphar commented Jun 19, 2024

cyphar commented Jun 19, 2024

tonistiigi left a comment

Choose a reason for hiding this comment

cyphar commented Jun 20, 2024 • edited Loading

cyphar commented Jun 20, 2024 •

edited

Loading