Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SxS dll loading fails in host process containers running on containerd 1.7 #366

Open
chris-raitano opened this issue May 23, 2023 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@chris-raitano
Copy link

chris-raitano commented May 23, 2023

Describe the bug
My team runs a ruby script within a hpc (host process container). This works on containerd 1.6, but after upgrading to containerd 1.7 ruby fails to start with the below error message

Program 'ruby.exe' failed to run: The application has failed to start because its side-by-side configuration is
incorrect. Please see the application event log or use the command-line sxstrace.exe tool for more detailAt
C:\hpc\opt\hostlogswindows\scripts\powershell\main.ps1:63 char:5
    +     & $rubypath ./opt/hostlogswindows/scripts/ruby/tomlparser-hostlog ...
    +     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.
At C:\hpc\opt\hostlogswindows\scripts\powershell\main.ps1:63 char:5
    +     & $rubypath ./opt/hostlogswindows/scripts/ruby/tomlparser-hostlog ...
    +     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ResourceUnavailable: (:) [], ApplicationFailedException
    + FullyQualifiedErrorId : NativeCommandFailed

The sxstrace.exe tool provides this additional info, which indicates that the manifest isn't found:

Begin Activation Context Generation.
Input Parameter:
        Flags = 0
        ProcessorArchitecture = AMD64
        CultureFallBacks = en-US;en
        ManifestPath = C:\hpc\ruby31\bin\ruby.exe
        AssemblyDirectory = C:\hpc\ruby31\bin\
        Application Config File =
-----------------
INFO: Parsing Manifest File C:\hpc\ruby31\bin\ruby.exe.
        INFO: Manifest Definition Identity is (null).
        INFO: Reference: ruby_builtin_dlls,type="win32",version="1.0.0.0"
INFO: Resolving reference ruby_builtin_dlls,type="win32",version="1.0.0.0".
        INFO: Resolving reference for ProcessorArchitecture ruby_builtin_dlls,type="win32",version="1.0.0.0".
                INFO: Resolving reference for culture Neutral.
                        INFO: Applying Binding Policy.
                                INFO: No binding policy redirect found.
                        INFO: Begin assembly probing.
                                INFO: Did not find the assembly in WinSxS.
                                INFO: Attempt to probe manifest at C:\hpc\ruby31\bin\ruby_builtin_dlls.DLL.
                                INFO: Attempt to probe manifest at C:\hpc\ruby31\bin\ruby_builtin_dlls.MANIFEST.
                                INFO: Attempt to probe manifest at C:\hpc\ruby31\bin\ruby_builtin_dlls\ruby_builtin_
dlls.DLL.
                                INFO: Attempt to probe manifest at C:\hpc\ruby31\bin\ruby_builtin_dlls\ruby_builtin_
dlls.MANIFEST.
                                INFO: Did not find manifest for culture Neutral.
                        INFO: End assembly probing.
        ERROR: Cannot resolve reference ruby_builtin_dlls,type="win32",version="1.0.0.0".
ERROR: Activation Context generation failed.
End Activation Context Generation.

However, the manifest does exist at the expected location (C:\hpc\ruby31\bin\ruby_builtin_dlls\ruby_builtin_dlls.manifest) with the below content

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0">
      <assemblyIdentity type="win32" name="ruby_builtin_dlls" version="1.0.0.0"></assemblyIdentity>
 
      <file name="libffi-8.dll"/><file name="libgmp-10.dll"/><file name="libwinpthread-1.dll"/><file name="libyaml-0-2.dll"/> 
      <filename="zlib1.dll"/><file name="libcrypto-1_1-x64.dll"/><file name="libgcc_s_seh-1.dll"/><file name="libssl-1_1-x64.dll"/>
    </assembly>

We are able to run ruby by copying the binaries outside of C:\hpc, but can’t run it inside C:\hpc. (In other words, we can run it inside the container if the container copies the files onto the host filesystem, but we lose the benefits of filesystem isolation)

Since this works both in containerd 1.6 hpc and in containerd 1.7 outside of the c:\hpc directory, my guess is it's likely related to the new bind mount used for the C:\hpc directory.

To Reproduce
I've created a lightweight container which we've been able to use to reproduce this

Dockerfile

FROM mcr.microsoft.com/windows/servercore:ltsc2019

# Install chocolatey
ENV chocolateyVersion 1.4.0
RUN powershell -Command "Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))"
# Install ruby
RUN choco install -y ruby --version 3.1.1.1 --params "'/InstallDir:C:\ruby31'"

COPY main.ps1 /main.ps1

ENTRYPOINT ["powershell", "C:\\hpc\\main.ps1"]

main.ps1

& ./ruby31/bin/ruby.exe --version

while($true){
    Start-Sleep 3600
}

And we've deployed it to our kubernetes cluster with this yaml, replacing <IMAGE> with the built container image from our container registry

apiVersion: apps/v1
kind: DaemonSet
metadata:
 name: ruby-test
 labels:
  app: ruby-test
spec:
 selector:
  matchLabels:
    app: ruby-test
 template:
  metadata:
    labels:
      app: ruby-test
  spec:
    securityContext:
      windowsOptions:
        hostProcess: true
        runAsUserName: "NT AUTHORITY\\SYSTEM"
    hostNetwork: true
    containers:
     - name: ruby-test
       image: <IMAGE>
       imagePullPolicy: Always
       workingDir: /hpc
    nodeSelector:
      kubernetes.io/os: windows

Expected behavior
Expected behavior is to be able to run executables with side-by-side dlls inside a host process container.
When running an exe within a container and a working directory inside the container filesystem (c:\hpc), we'd expect the SxS dll loader to be able to find manifests and load dlls within the container filesystem

Configuration:

  • Edition: Windows Server 2019
  • Base Image being used: Windows Server Core 2019
  • Container engine: containerd
  • Container Engine version 1.7

Additional context

@chris-raitano chris-raitano added the bug Something isn't working label May 23, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot added the triage New and needs attention label May 23, 2023
@jsturtevant
Copy link

We are able to run ruby by copying the binaries outside of C:\hpc, but can’t run it inside C:\hpc. (In other words, we can run it inside the container if the container copies the files onto the host filesystem, but we lose the benefits of filesystem isolation)

Since this works both in containerd 1.6 hpc and in containerd 1.7 outside of the c:\hpc directory, my guess is it's likely related to the new bind mount used for the C:\hpc directory.

Thanks for the really detailed and easy to repo steps. The current work around is to copy this to the host and run it.

This does appear to be due to the bind mount feature. It seems like something is executing from the host context which wouldn't see that binding: https://github.com/microsoft/hcsshim/blob/e7b0eab484b277ab1a30a282b7232744a34e6624/internal/jobcontainers/jobcontainer.go#L334-L341

@ntrappe-msft ntrappe-msft removed the triage New and needs attention label Jun 12, 2023
@fady-azmy-msft
Copy link
Contributor

Closing because workaround seems low effort and acceptable.

@alrodrig1
Copy link

alrodrig1 commented Jul 3, 2023

Closing because workaround seems low effort and acceptable.

@fady-azmy-msft ,

This workaround is not acceptable long term, only as a temporary measure to unblock us.

This is a serious bug that would impact any user trying to use HostProcess containers.

Is this being tracked somewhere else? If so, please provide a link to the issue so we can follow its resolution.

@AbelHu
Copy link

AbelHu commented Jul 4, 2023

Closing because workaround seems low effort and acceptable.

@fady-azmy-msft I think that this bug may cause user's downtime if the user upgrade their cluster to use containerd 1.7 but hit this issue.

@fady-azmy-msft
Copy link
Contributor

ACK. Engineering is looking into this issue, but I don't have any timelines on this

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

1 similar comment
@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

@fschmied
Copy link

fschmied commented Nov 4, 2023

Just FYI, in Azure/AKS#3885, @AbelHu commented on this as "may not be fixed (by design)". (However, this might just mean "at this point".)

Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

2 similar comments
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

3 similar comments
Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
no assignees, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@kiashok, please provide an update or close this issue.

4 similar comments
Copy link
Contributor

This issue has been open for 30 days with no updates.
@kiashok, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@kiashok, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@kiashok, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@kiashok, please provide an update or close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants