Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rawhide] Cannot SSH into node #1775

Closed
aaradhak opened this issue Aug 12, 2024 · 13 comments
Closed

[rawhide] Cannot SSH into node #1775

aaradhak opened this issue Aug 12, 2024 · 13 comments
Assignees
Labels
jira for syncing to jira kind/bug pipeline failure This issue or pull request is derived from CI failures

Comments

@aaradhak
Copy link
Member

aaradhak commented Aug 12, 2024

jlebon: see #1775 (comment) and following. This is unlikely to be an SELinux issue.

Original report follows.


Describe the bug

The kola basic.uefi test is failing with AVC denials when using selinux-policy-41.13-1.fc41 RPM. The test failure occurs with the following error:

[coreos-assembler]$ kola run basic.uefi
:black_right_pointing_double_triangle_with_vertical_bar:  Skipping kola test pattern "fcos.internet":
  :point_right: https://github.com/coreos/coreos-assembler/pull/1478
:black_right_pointing_double_triangle_with_vertical_bar:  Skipping kola test pattern "podman.workflow":
  :point_right: https://github.com/coreos/coreos-assembler/pull/1478
=== RUN   basic.uefi
--- FAIL: basic.uefi (611.40s)
        core.go:163: machine "8ab3f07b-a278-4ac7-9f28-48ae58a45dd5" failed to start: ssh journalctl failed: time limit exceeded
FAIL, output in tmp/kola/qemu-2024-08-07-1441-17364
Error: harness: test suite failed
2024-08-07T14:51:31Z cli: harness: test suite failed

[coreos-assembler]$ ls overrides/rpm/
repodata                                selinux-policy-devel-41.13-1.fc41.noarch.rpm  selinux-policy-minimum-41.13-1.fc41.noarch.rpm  selinux-policy-sandbox-41.13-1.fc41.noarch.rpm
selinux-policy-41.13-1.fc41.noarch.rpm  selinux-policy-doc-41.13-1.fc41.noarch.rpm    selinux-policy-mls-41.13-1.fc41.noarch.rpm      selinux-policy-targeted-41.13-1.fc41.noarch.rpm

console log:
The avc denials seen here seem to be all permissive=1

Welcome to ^[[0;38;2;60;110;180mFedora CoreOS 41.20240807.dev.0^[[0m!^M
^M^M
[    6.220428] systemd[1]: Initializing machine ID from VM UUID.^M
[    6.547894] systemd[1]: bpf-restrict-fs: LSM BPF program attached^M
[    6.658549] audit: type=1400 audit(1723041697.324:4): avc:  denied  { getattr } for  pid=1288 comm="coreos-boot-mou" path="/run/coreos/bootfs_uuid" dev="tmpfs" ino=1145 scontext=system_u:system_r:coreos_boot_mount_generator_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=1^M
[    6.663528] zram_generator::config[1323]: No configuration found.^M
[    6.669574] audit: type=1400 audit(1723041697.335:5): avc:  denied  { read } for  pid=1326 comm="cat" name="bootfs_uuid" dev="tmpfs" ino=1145 scontext=system_u:system_r:coreos_boot_mount_generator_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=1^M
[    6.673164] audit: type=1400 audit(1723041697.338:6): avc:  denied  { open } for  pid=1326 comm="cat" path="/run/coreos/bootfs_uuid" dev="tmpfs" ino=1145 scontext=system_u:system_r:coreos_boot_mount_generator_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=1^M
[    6.696011] Guest personality initialized and is inactive^M
[    6.697373] VMCI host device registered (name=vmci, major=10, minor=123)^M

Reproduction steps

  1. Clone the fcos config repo and checkout rawhide branch. Use the rawhide branch to fetch and build the image.
  2. Drop the selinux-policy related pins from manifest-lock-overrides.yaml
  3. cosa init https://github.com/coreos/fedora-coreos-config
  4. koji download selinux-policy-41.13-1.fc41 to overrides/rpm
  5. cosa fetch && cosa build
  6. Run kola test basic.uefi

Expected behavior

  1. cosa run to come up

  2. kola test basic.uefi to PASS

Actual behavior

  1. cosa run hangs

  2. kola test basic.uefi fails:

[coreos-assembler]$ kola run basic.uefi
:black_right_pointing_double_triangle_with_vertical_bar:  Skipping kola test pattern "fcos.internet":
  :point_right: https://github.com/coreos/coreos-assembler/pull/1478
:black_right_pointing_double_triangle_with_vertical_bar:  Skipping kola test pattern "podman.workflow":
  :point_right: https://github.com/coreos/coreos-assembler/pull/1478
=== RUN   basic.uefi
--- FAIL: basic.uefi (611.40s)
        core.go:163: machine "8ab3f07b-a278-4ac7-9f28-48ae58a45dd5" failed to start: ssh journalctl failed: time limit exceeded
FAIL, output in tmp/kola/qemu-2024-08-07-1441-17364
Error: harness: test suite failed
2024-08-07T14:51:31Z cli: harness: test suite failed

System details

upstream rawhide x86_64 with selinux-policy-41.13-1.fc41

Butane or Ignition config

No response

Additional information

No response

@aaradhak aaradhak changed the title [rawhide] : kola test failure with selinux-policy-41.13-1.fc41 rpm [rawhide] kola test failure with selinux-policy-41.13-1.fc41 rpm Aug 13, 2024
@gursewak1997 gursewak1997 added the pipeline failure This issue or pull request is derived from CI failures label Aug 13, 2024
@marmijo
Copy link
Member

marmijo commented Aug 13, 2024

I followed these steps with the latest SELinux policy, selinux-policy 41.14.1.fc41, which was just released. After running through the steps, I found that the test appears to be passing. Maybe this newer package resolved the issue?

Could you try using selinux-policy 41.14.1.fc41 and see if the issue is still happening on your end?

@jlebon
Copy link
Member

jlebon commented Aug 13, 2024

The real test is to open a PR against rawhide that drops almost all the overrides (except the json-glib one) and fast-tracks that selinux-policy package and see what CI says. If it's green, we merge :)

@aaradhak
Copy link
Member Author

aaradhak commented Aug 13, 2024

With the pins except the json-glib dropped and selinux-policy-41.14-1.fc41 , the test seem to continue to fail with the same reason.

@aaradhak
Copy link
Member Author

aaradhak commented Aug 13, 2024

Opened this PR for testing - coreos/fedora-coreos-config#3090
The CI test failed

[2024-08-14T02:40:50.382Z] --- FAIL: basic.uefi (609.90s)

[2024-08-14T02:40:50.382Z]         core.go:163: machine "57eae8b5-565b-45c4-9fa9-b79949c3ced3" failed to start: ssh journalctl failed: time limit exceeded

@marmijo
Copy link
Member

marmijo commented Aug 13, 2024

I tried that locally too and that issue is still persisting: #1735.

Is there a reason we cant just remove only the selinux pin? I tested that locally and the build succeeds and the kola tests pass. I also opened a PR for testing: coreos/fedora-coreos-config#3091

@aaradhak
Copy link
Member Author

aaradhak commented Aug 13, 2024

We'll need to remove the pins related to #1735 as the selinux-policy was found to the reason for CI failure in PR-3010 . Thats like the starting point of all this.

@aaradhak
Copy link
Member Author

@jlebon Tested by adding enforcing=0 in image-base.yaml along with the other changes of removing all overrides except json-glib . The issue seems to persist.

aaradhak@fedora ~/coderepo/fedora-coreos-config (rawhide)$ git diff
diff --git a/image-base.yaml b/image-base.yaml
index cc038b04..be43f267 100644
--- a/image-base.yaml
+++ b/image-base.yaml
@@ -9,6 +9,7 @@ size: 10
 extra-kargs:
     # Disable SMT on systems vulnerable to MDS or any similar future issue.
     - mitigations=auto,nosmt
+    - enforcing=0

@jlebon
Copy link
Member

jlebon commented Aug 14, 2024

@aaradhak Thanks for testing!

OK, so it seems like definitely we have a separate possibly systemd-related problem, on our hands (because the only other package we were holding back apart from selinux-policy was systemd).

Let's update this issue to reflect that.

@jlebon jlebon changed the title [rawhide] kola test failure with selinux-policy-41.13-1.fc41 rpm [rawhide] Cannot SSH into node Aug 14, 2024
@jlebon
Copy link
Member

jlebon commented Aug 14, 2024

So here's a test: run cosa run -c on the build, then add a key to the core user's authorized_keys, then from the cosa container from where you started the VM, use ss -lntp to see what port QEMU is using to proxy SSH, and try to SSH to core@localhost:PORT with the same key you added and see if it works. If it doesn't, check the VM's journal for entries.

@aaradhak
Copy link
Member Author

aaradhak commented Aug 14, 2024

@jlebon Ok, so I followed steps mentioned and it seems that the ssh to core@localhost gets denied due to permission issue.

cosa run -c

[core@cosa-devsh ~]$ cat ~/.ssh/id_rsa.pub 
ssh-rsa AAAAB....
...FBQ== core@localhost

[core@cosa-devsh ~]$ cat ~/.ssh/authorized_keys 
ssh-rsa AAAAB....
...FBQ== core@localhost

[core@cosa-devsh ~]$ chmod 700 ~/.ssh
[core@cosa-devsh ~]$ chmod 600 ~/.ssh/authorized_keys

QEMU SSH Port check:

[builder@b594f0ee9075 srv]$ ss -lntp | grep qemu
LISTEN 0      1          127.0.0.1:45235      0.0.0.0:*    users:(("qemu-system-x86",pid=125,fd=19))

[builder@b594f0ee9075 srv]$ ssh -p 45235 core@localhost
core@localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Journal log output:

[core@cosa-devsh ~]$ journalctl -u sshd
Aug 14 20:33:25 localhost.localdomain systemd[1]: Starting sshd.service - OpenSSH server daemon...
Aug 14 20:33:25 localhost.localdomain (sshd)[1902]: sshd.service: Referenced but unset environment variable evaluates to an empty string: OPTIONS
Aug 14 20:33:25 localhost.localdomain sshd[1902]: Server listening on 0.0.0.0 port 22.
Aug 14 20:33:25 localhost.localdomain sshd[1902]: Server listening on :: port 22.
Aug 14 20:33:25 localhost.localdomain systemd[1]: Started sshd.service - OpenSSH server daemon.
Aug 14 20:41:59 cosa-devsh sshd-session[2187]: Accepted publickey for core from ::1 port 37372 ssh2: RSA SHA256:des53qCrojkQ0tlspis1QhB6R5ely0iycOUUQKw/cNo
Aug 14 20:41:59 cosa-devsh sshd-session[2187]: pam_systemd(sshd:session): New sd-bus connection (system-bus-pam-systemd-2187) opened.
Aug 14 20:41:59 cosa-devsh sshd-session[2187]: pam_unix(sshd:session): session opened for user core(uid=1000) by core(uid=0)
Aug 14 20:41:59 cosa-devsh sshd-session[2187]: pam_systemd(sshd:session): New sd-bus connection (system-bus-pam-systemd-2187) opened.
Aug 14 20:41:59 cosa-devsh sshd-session[2187]: pam_unix(sshd:session): session closed for user core
Aug 14 20:50:11 cosa-devsh sshd-session[2283]: Connection closed by authenticating user core 10.0.2.2 port 48938 [preauth]
Aug 14 20:52:55 cosa-devsh sshd-session[2294]: Connection closed by authenticating user core 10.0.2.2 port 42950 [preauth]
Aug 14 20:52:57 cosa-devsh sshd-session[2296]: Connection closed by authenticating user core 10.0.2.2 port 42960 [preauth]
Aug 14 20:53:12 cosa-devsh systemd[1]: Stopping sshd.service - OpenSSH server daemon...
Aug 14 20:53:12 cosa-devsh sshd[1902]: Received signal 15; terminating.
Aug 14 20:53:12 cosa-devsh systemd[1]: sshd.service: Deactivated successfully.
Aug 14 20:53:12 cosa-devsh systemd[1]: Stopped sshd.service - OpenSSH server daemon.
-- Boot 1c8748f00a704951ae489d678e6e3108 --
Aug 14 20:53:22 localhost.localdomain systemd[1]: Starting sshd.service - OpenSSH server daemon...
Aug 14 20:53:22 localhost.localdomain (sshd)[1189]: sshd.service: Referenced but unset environment variable evaluates to an empty string: OPTIONS
Aug 14 20:53:22 localhost.localdomain sshd[1189]: Server listening on 0.0.0.0 port 22.
Aug 14 20:53:22 localhost.localdomain sshd[1189]: Server listening on :: port 22.
Aug 14 20:53:22 localhost.localdomain systemd[1]: Started sshd.service - OpenSSH server daemon.
Aug 14 20:54:42 cosa-devsh sshd-session[1403]: Connection closed by authenticating user core 10.0.2.2 port 52798 [preauth]
Aug 14 20:55:00 cosa-devsh sshd-session[1405]: Connection closed by authenticating user core 10.0.2.2 port 53234 [preauth]
Aug 14 20:56:35 cosa-devsh sshd-session[1414]: Accepted publickey for core from ::1 port 48664 ssh2: RSA SHA256:des53qCrojkQ0tlspis1QhB6R5ely0iycOUUQKw/cNo
Aug 14 20:56:35 cosa-devsh sshd-session[1414]: pam_systemd(sshd:session): New sd-bus connection (system-bus-pam-systemd-1414) opened.
Aug 14 20:56:36 cosa-devsh sshd-session[1414]: pam_unix(sshd:session): session opened for user core(uid=1000) by core(uid=0)
Aug 14 21:03:24 cosa-devsh sshd-session[1471]: Connection closed by authenticating user core 10.0.2.2 port 34654 [preauth]
Aug 14 21:05:22 cosa-devsh sshd-session[1487]: Connection closed by authenticating user core 10.0.2.2 port 59504 [preauth]
Aug 14 21:07:39 cosa-devsh sshd-session[1491]: Connection closed by authenticating user core 10.0.2.2 port 44794 [preauth]
Aug 14 21:10:41 cosa-devsh sshd-session[1517]: Connection closed by authenticating user core 10.0.2.2 port 46570 [preauth]

Looking at the entries with [preauth], i assume that the connection was attempted but failed during the authentication phase, likely due to the public key not being accepted or the key not being recognized.
Checked if changing the access mode of authorized_keys would help but the ssh doesn't seem to go through still.

@jlebon jlebon self-assigned this Aug 21, 2024
@jlebon jlebon added the jira for syncing to jira label Aug 21, 2024
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Aug 21, 2024
systemd v256 added a new userdb functionality where SSH authorized
keys can be part of a User Record. To make this transparently
work with sshd authentication, an sshd config dropin that sets an
`AuthorizedKeysCommand` directive was added.

Unfortunately, it was added with a higher priority than intended,
which meant that it overrode the `AuthorizedKeysCommand` directive from
`ssh-key-dir`, which is how our `~/.ssh/authorized_keys.d/` magic works
today with Ignition and Afterburn. So the end result is that this broke
SSH which of course broke kola too.

This is tracked in upstream systemd at:

systemd/systemd#33648

The dropin was recently reverted in Fedora:

https://src.fedoraproject.org/rpms/systemd/c/38291e13c1dec15618b7d09e4217d10076897cdf?branch=rawhide

The latest rawhide systemd build with that change is already in the
repos:

https://bodhi.fedoraproject.org/updates/FEDORA-2024-ff872f0544

So we can just drop all the overrides to pull in the latest systemd.

We'll need to keep an eye on the conversation there to make sure that
the final solution doesn't re-break FCOS, but we would notice it pretty
quickly too.

Closes: coreos/fedora-coreos-tracker#1775
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Aug 21, 2024
systemd v256 added a new userdb functionality where SSH authorized
keys can be part of a User Record. To make this transparently
work with sshd authentication, an sshd config dropin that sets an
`AuthorizedKeysCommand` directive was added.

Unfortunately, it was added with a higher priority than intended,
which meant that it overrode the `AuthorizedKeysCommand` directive from
`ssh-key-dir`, which is how our `~/.ssh/authorized_keys.d/` magic works
today with Ignition and Afterburn. So the end result is that this broke
SSH which of course broke kola too.

This is tracked in upstream systemd at:

systemd/systemd#33648

The dropin was recently reverted in Fedora:

https://src.fedoraproject.org/rpms/systemd/c/38291e13c1dec15618b7d09e4217d10076897cdf?branch=f41

The latest f41 systemd build with that change is already in the
repos:

https://bodhi.fedoraproject.org/updates/FEDORA-2024-8d144cc8af

So we can just drop all the overrides to pull in the latest systemd.

We'll need to keep an eye on the conversation there to make sure that
the final solution doesn't re-break FCOS, but we would notice it pretty
quickly too.

Closes: coreos/fedora-coreos-tracker#1775
@jlebon
Copy link
Member

jlebon commented Aug 21, 2024

Should be fixed by coreos/fedora-coreos-config#3115 and coreos/fedora-coreos-config#3116.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Aug 22, 2024
systemd v256 added a new userdb functionality where SSH authorized
keys can be part of a User Record. To make this transparently
work with sshd authentication, an sshd config dropin that sets an
`AuthorizedKeysCommand` directive was added.

Unfortunately, it was added with a higher priority than intended,
which meant that it overrode the `AuthorizedKeysCommand` directive from
`ssh-key-dir`, which is how our `~/.ssh/authorized_keys.d/` magic works
today with Ignition and Afterburn. So the end result is that this broke
SSH which of course broke kola too.

This is tracked in upstream systemd at:

systemd/systemd#33648

The dropin was recently reverted in Fedora:

https://src.fedoraproject.org/rpms/systemd/c/38291e13c1dec15618b7d09e4217d10076897cdf?branch=rawhide

Fast-track the latest rawhide systemd build with that change.

We'll need to keep an eye on the conversation there to make sure that
the final solution doesn't re-break FCOS, but we would notice it pretty
quickly too.

Closes: coreos/fedora-coreos-tracker#1775
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Aug 22, 2024
systemd v256 added a new userdb functionality where SSH authorized
keys can be part of a User Record. To make this transparently
work with sshd authentication, an sshd config dropin that sets an
`AuthorizedKeysCommand` directive was added.

Unfortunately, it was added with a higher priority than intended,
which meant that it overrode the `AuthorizedKeysCommand` directive from
`ssh-key-dir`, which is how our `~/.ssh/authorized_keys.d/` magic works
today with Ignition and Afterburn. So the end result is that this broke
SSH which of course broke kola too.

This is tracked in upstream systemd at:

systemd/systemd#33648

The dropin was recently reverted in Fedora:

https://src.fedoraproject.org/rpms/systemd/c/38291e13c1dec15618b7d09e4217d10076897cdf?branch=f41

Fast-track the latest f41 systemd build with that change.

We'll need to keep an eye on the conversation there to make sure that
the final solution doesn't re-break FCOS, but we would notice it pretty
quickly too.

Closes: coreos/fedora-coreos-tracker#1775
jlebon added a commit to coreos/fedora-coreos-config that referenced this issue Aug 22, 2024
systemd v256 added a new userdb functionality where SSH authorized
keys can be part of a User Record. To make this transparently
work with sshd authentication, an sshd config dropin that sets an
`AuthorizedKeysCommand` directive was added.

Unfortunately, it was added with a higher priority than intended,
which meant that it overrode the `AuthorizedKeysCommand` directive from
`ssh-key-dir`, which is how our `~/.ssh/authorized_keys.d/` magic works
today with Ignition and Afterburn. So the end result is that this broke
SSH which of course broke kola too.

This is tracked in upstream systemd at:

systemd/systemd#33648

The dropin was recently reverted in Fedora:

https://src.fedoraproject.org/rpms/systemd/c/38291e13c1dec15618b7d09e4217d10076897cdf?branch=f41

Fast-track the latest f41 systemd build with that change.

We'll need to keep an eye on the conversation there to make sure that
the final solution doesn't re-break FCOS, but we would notice it pretty
quickly too.

Closes: coreos/fedora-coreos-tracker#1775
jlebon added a commit to coreos/fedora-coreos-config that referenced this issue Aug 26, 2024
systemd v256 added a new userdb functionality where SSH authorized
keys can be part of a User Record. To make this transparently
work with sshd authentication, an sshd config dropin that sets an
`AuthorizedKeysCommand` directive was added.

Unfortunately, it was added with a higher priority than intended,
which meant that it overrode the `AuthorizedKeysCommand` directive from
`ssh-key-dir`, which is how our `~/.ssh/authorized_keys.d/` magic works
today with Ignition and Afterburn. So the end result is that this broke
SSH which of course broke kola too.

This is tracked in upstream systemd at:

systemd/systemd#33648

The dropin was recently reverted in Fedora:

https://src.fedoraproject.org/rpms/systemd/c/38291e13c1dec15618b7d09e4217d10076897cdf?branch=rawhide

Fast-track the latest rawhide systemd build with that change.

We'll need to keep an eye on the conversation there to make sure that
the final solution doesn't re-break FCOS, but we would notice it pretty
quickly too.

Closes: coreos/fedora-coreos-tracker#1775
@jbtrystram
Copy link
Contributor

jbtrystram commented Aug 27, 2024

i am still seeing this errors on the s390x build of branched : https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build-arch/detail/build-arch/106/pipeline/

relevant packages :
selinux-policy 40.13-1.fc40 -> 41.14-1.fc41
selinux-policy-targeted 40.13-1.fc40 -> 41.14-1.fc41
systemd 255.3-1.fc40 -> 256.5-1.fc41

Most tests fails with

[    8.131680] (sd-exec-[1340]: /usr/lib/systemd/system-generators/systemd-ssh-generator failed with exit status 1.

Relevant selinux denials :

[    8.111974] audit: type=1400 audit(1724677759.011:4): avc:  denied  { read } for  pid=1358 comm="systemd-ssh-gen" name="sysinfo" dev="proc" ino=4026531945 scontext=system_u:system_r:systemd_ssh_generator_t:s0 tcontext=system_u:object_r:sysctl_t:s0 tclass=file permissive=0
[    8.112130] audit: type=1400 audit(1724677759.011:5): avc:  denied  { getattr } for  pid=1341 comm="coreos-boot-mou" path="/run/coreos/bootfs_uuid" dev="tmpfs" ino=781 scontext=system_u:system_r:coreos_boot_mount_generator_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=1
[    8.114390] audit: type=1400 audit(1724677759.011:6): avc:  denied  { read } for  pid=1378 comm="cat" name="bootfs_uuid" dev="tmpfs" ino=781 scontext=system_u:system_r:coreos_boot_mount_generator_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=1
[    8.114440] audit: type=1400 audit(1724677759.011:7): avc:  denied  { open } for  pid=1378 comm="cat" path="/run/coreos/bootfs_uuid" dev="tmpfs" ino=781 scontext=system_u:system_r:coreos_boot_mount_generator_t:s0 tcontext=system_u:object_r:var_run_t:s0 tclass=file permissive=1

@jlebon
Copy link
Member

jlebon commented Aug 27, 2024

@jbtrystram That looks like a different bug. Can you make a separate issue for this so we don't get confused? Let's close this one.

@jlebon jlebon closed this as completed Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira kind/bug pipeline failure This issue or pull request is derived from CI failures
Projects
None yet
Development

No branches or pull requests

5 participants