Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some computestorage.FormatWritableLayerVhd breakage in master vs 0.9.2? #1375

Open
TBBle opened this issue Apr 26, 2022 · 21 comments
Open

Some computestorage.FormatWritableLayerVhd breakage in master vs 0.9.2? #1375

TBBle opened this issue Apr 26, 2022 · 21 comments

Comments

@TBBle
Copy link
Contributor

TBBle commented Apr 26, 2022

I was attempting to rebase containerd/containerd#4419 onto #901 to verify it, and this involves upgrading hcsshim from v0.9.2 to master (+my PR).

I noticed a couple of unrelated test failures in containerd, and so I put together a branch that was just the hcsshim upgrade (specific commit in CI) and it produces the same test failures, which are different failures for the same test in Windows Server 2019 and Windows Server 2022.

The failing test is pkg/os TestResolvePath line 153, and on Windows Serve 2019 it reports:

=== FAIL: pkg/os TestResolvePath (0.50s)
    os_windows_test.go:153: failed to format VHD: failed to attach virtual disk: Incorrect function.

while on Windows Server 2022 it reports

=== FAIL: pkg/os TestResolvePath (0.22s)
    os_windows_test.go:153: failed to format VHD: failed to attach virtual disk: The process cannot access the file because it is being used by another process.

My guess is that there's some conflict between the RS5 workaround for disk handle vs VHD handle in containerd and the same workaround in hcsshim master which wasn't present in v0.9.2.

That said, since neither workaround should be active on Windows Server 2022, perhaps there's something else going on.

I haven't tested it (I might quickly do so by removing the workaround from containerd), but perhaps the function should be renamed or have its type changed in some way to highlight to callers that the API behaviour has changed and now always expects the VHD Handle, never the disk handle?

@TBBle TBBle changed the title Some VHD breakage in master vs 0.9.2? Some computestorage.FormatWritableLayerVhd breakage in master vs 0.9.2? Apr 26, 2022
@TBBle
Copy link
Contributor Author

TBBle commented Apr 26, 2022

Okay, that's fun. I tried the obvious fix and now on Windows Server 2019, we get

=== FAIL: pkg/os TestResolvePath (0.42s)
    os_windows_test.go:81: failed to format VHD: failed to attach virtual disk: The process cannot access the file because it is being used by another process.

and on Windows Server 2022 we get

=== FAIL: pkg/os TestResolvePath (0.21s)
    os_windows_test.go:81: failed to format VHD: failed to attach virtual disk: The process cannot access the file because it is being used by another process.

i.e. both fail the way Windows Server 2022 did earlier, which means that something has changed elsewhere. The earlier Incorrect function error was the double-workaround causing the hcsshim workaround to receive a Disk Handle when it expected a VHD Handle to turn into a disk Handle.

On Windows Server 2019, this might be another behaviour change, that the hcshim API expects the VHD to not be attached now, but used to expect the VHD to be attached? Actually, looking closer, this is probably also related to the workaround lowering, since the containerd-level workaround didn't attach the VHD first, because the calling code already did. The comment on computestorage.FormatWritableLayerVhd says "If the VHD is not mounted it will be temporarily mounted." but I suspect right now the actual state is "The VHD must not be mounted, as it will be temporarily mounted", but this might be a factor of the Windows Server 2019 workaround codepath.

Oddly enough, failed to attach virtual disk suggests that it's hitting the osversion.Build() < osversion.V19H1 codepath in both cases, which means perhaps osversion.Build() is coming in too low, which suggests that something's not manifested correctly.

I do note that the containerd version of the workaround explicitly avoided relying on being manifested correctly, so perhaps the issue was that (without my quick-fix) on Server 2019 the workaround was being applied twice instead of once, and on Windows Server 2022 irrespective of my quick-fix, the workaround is being applied once instead of nonce because the containerd unit tests are not manifested so osversion.Build doesn't work.

@dcantah
Copy link
Contributor

dcantah commented Apr 26, 2022

@TBBle It's likely because containerd isn't manifested as you mentioned in the end. I believe we'll always enter the if statement as it'll return windows 8's build num. I think doing the same build number check as we do in the test might make sense for this package to avoid anyone using it to manifest, but maybe the right move is to just manifest containerd like we do for docker. Let me think about this

@TBBle
Copy link
Contributor Author

TBBle commented Apr 26, 2022

On my last point, I also changed the containerd test to not attach the VHD before calling the format API, and now Windows Server 2019 passes, and Windows Server 2022 reports

=== FAIL: pkg/os TestResolvePath (0.30s)
    os_windows_test.go:73: failed to format VHD: failed to format writable layer vhd: Incorrect function.

which makes sense as Windows Server 2022 was expected a VHD Handle, but because this code relies on the manifested caller, and containerd's unit tests are not that, it got a Disk handle through the RS5 code-path.

Also, for the record, we hit FormatWritableLayerVhd in non-test flows in containerd too, but via other hcsshim APIs such as computestorage.SetupContainerBaseLayer. So it's not just the tests.

I'm mildly concerned about hcsshim now requiring its callers to be manifested when they aren't using hcsshim/osversion.Build() directly, but I guess using something like go-winres (c.f. moby/moby#43431) would mean at least we can have the manifest as part of the local build-chain rather than relying on pulling in the hcsshim/test module just for the manifest it uses, as the containerd/integration/client module does.

There was also the recent discussion (I forget if it was here, containerd, or moby...) that perhaps libraries should use the registry to determine the Windows build rather than requiring callers to be manifested to return the right results. I feel like this is more user-friendly when the actual "os version check" interaction is not exposed to the user, as it's easier to not get wrong, particularly if your test coverage is not as wide as containerd's is.

@dcantah
Copy link
Contributor

dcantah commented Apr 26, 2022

I'm mildly concerned about hcsshim now requiring its callers to be manifested when they aren't using hcsshim/osversion.Build() directly,

I'm not ecstatic about this either 😔 I was hoping we could find a way around or converge on changing osversion to not use GetVersion; I hear conflicting opinions internally on what's the safe way to version check. My ideal world is anything that has branching behavior based on version differences and is just a library we expose should grab a version number in a different manner, but that might not be the "right" way.

@dcantah
Copy link
Contributor

dcantah commented Apr 26, 2022

@TBBle I still think the easiest route here is finding what the downsides are of swapping osversion to a non GetVersion approach, but we'll see. Glad this didn't slip into a released tag though thankfully

@TBBle
Copy link
Contributor Author

TBBle commented Apr 27, 2022

Took me a while, but the discussion I was thinking of was containerd/containerd#6697 (comment), for which you already have context. ^_^

@dcantah
Copy link
Contributor

dcantah commented Apr 28, 2022

@TBBle Reached out to some peeps internally to try and close on this, waiting to hear back. Using the build number for feature detection is generally frowned upon but sometimes it's the easiest route.. so it'd be nice to not require manifesting your application because of leakage of the osversion pkg to others in this repo.

@jterry75
Copy link
Contributor

#ugh...

@dcantah
Copy link
Contributor

dcantah commented Apr 28, 2022

#ugh...

Isn't this fun 😌

@dcantah
Copy link
Contributor

dcantah commented May 5, 2022

@TBBle Well the internal vote has been cast and.. manifesting is still the only supported approach 😐 Looks like we'll have to manifest our friend Containerd it seems

@TBBle
Copy link
Contributor Author

TBBle commented May 5, 2022

I opened containerd/containerd#6896 for the containerd side of this discussion.

@TBBle
Copy link
Contributor Author

TBBle commented May 5, 2022

A passing thought: How would we feel about panicking if osversion.Build() sees Windows 8? There was no HCS on Windows 8, so is there ever a time that's a useful return value, and not a "You forgot to manifest the binary" situation?

@dcantah
Copy link
Contributor

dcantah commented Jul 19, 2022

@TBBle Not a bad idea, I was almost thinking for cases like this to make it more flexible on our side for APIs that call it internally, we just have a constant for the build number that will be returned if it's not manifested. So for computestorage we can grab the build number and if it's the non-manifested number, we just take the handle as is and hope the caller handled things. wdyt?

@dcantah
Copy link
Contributor

dcantah commented Jul 19, 2022

e.g.

// Not the final naming 😆
if build := osversion.Build(); build == osversion.PleaseManifest {
   // Don't do anything fancy if we were going to, up to the caller to handle 
  // anything needed on a different os version.
} else if build < osversion.V19H1 {
   // Do stuff
}

@dcantah
Copy link
Contributor

dcantah commented Jul 19, 2022

@jterry75 I'm curious on your thoughts on the above also. We've got some folks out this week but I'm thinking of just handling the computestorage situation for now in the meantime

@dcantah
Copy link
Contributor

dcantah commented Jul 19, 2022

I kinda feel it's bordering on too much to worry about, but needing to do the dance for FormatWritableLayerVhd yourself because of an OS breaking change is also gross. I think ideally we should handle it for you

@jterry75
Copy link
Contributor

Do any of the HCS API's that check for capabilities return the OS version? Could we proxy the call to the HCS basically, get the result, and then act in our code base. Do it in a once call so that after the process knows it's always the same?

@dcantah
Copy link
Contributor

dcantah commented Jul 25, 2022

@jterry75 We might be able to do something with grabbing the current supported schema version, but the internal docs for RS5 for what this returns are a bit odd. It looks like at one point it started off at a lower version and then got bumped (probably from a KB, or the docs are wrong lol). I also don't know if they ever plan to bump schema versions on past OS'. Trying to get an answer here

@jterry75
Copy link
Contributor

Ok, horrid answer, what if we do a once shell out to cmd to get the answer, then cache. So our process incurs only one awful awful hack, but then nobody needs to be manifested?

@dcantah
Copy link
Contributor

dcantah commented Jul 26, 2022

@jterry75 @helsaawy Actually had this idea also 😆 Are you suggesting this for the osversion package in general or just for the computestorage call

@jterry75
Copy link
Contributor

The osversion in general. It will all be in the same process, so if we fix it there we fix it everywhere. And honestly, Windows processes start so fast anyways.... we wont even notice the change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants