Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure RDMA service loads modules in initrd #1481

Merged
merged 3 commits into from
Jul 18, 2024

Conversation

cjubran
Copy link
Contributor

@cjubran cjubran commented Jul 2, 2024

Fixed an issue where the RDMA service was killed during switch to root. Added a wants symlink for initrd.target in the dracut arrangement and Before=initrd.target to the systemd service to ensure it runs and completes during initrd.

Ensure that the rdma-load-modules@.service is started as part of the
initrd.

Fixes: 39fa824 ("redhat: add udev/systemd/etc infrastructure bits")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Ensure that the rdma-load-modules@.service is started as part of the
initrd.

Fixes: 7752410 ("suse: fix dracut support")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
rdma-load-modules@.service run in the initrd. However, it gets
terminated when initrd-cleanup.service isolates for
initrd-switch-root.target. The termination can occur in the middle of
the IPoIB initialization, leading to a failure to load netdevices.

Include 'Before=initrd.target' to ensure that the services are not
being killed when initrd-cleanup.service isolates to
initrd-switch-root.target.

Kernel log:

workqueue: Failed to create a rescuer kthread for wq "ipoib_wq": -EINTR
Cleaning Up and Shutting Down Daemons
ib0: failed to allocate device WQ
mlx5_0: failed to initialize device: ib0 port 1 (ret = -12)
mlx5_0: couldn't register ipoib port 1; error -12
workqueue: Failed to create a rescuer kthread for wq "mlx5e": -EINTR
ibp6s0f1, 1: ipoib_intf_alloc failed -12
workqueue: Failed to create a rescuer kthread for wq "mlx5e": -EINTR
ibp6s0f2, 1: ipoib_intf_alloc failed -12
workqueue: Failed to create a rescuer kthread for wq "mlx5e": -EINTR
ibp6s0f3, 1: ipoib_intf_alloc failed -12
Stopped Load RDMA modules …/rdma/modules/infiniband.conf
Stopped Load RDMA modules …m /etc/rdma/modules/rdma.conf

Fixes: 2f4fb9f ("Common infrastructure for auto loading rdma modules")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
@nmorey
Copy link
Contributor

nmorey commented Jul 2, 2024

Can i ask why are you running this service in initrd?
Shouldn't any service/target/FS that needs RDMA be the one to have a Before=initrd.target ?

@cjubran
Copy link
Contributor Author

cjubran commented Jul 2, 2024

Can i ask why are you running this service in initrd? Shouldn't any service/target/FS that needs RDMA be the one to have a Before=initrd.target ?

We run this service in initrd to load RDMA modules early in the boot process. It prevents issues like failed NFS mounts over RDMA due to missing modules.

@jgunthorpe
Copy link
Member

I talked to Chuck and he told me that NFS module autoloading works fine, even checked it out for me.

So why is it failing in the initrd?

@cjubran
Copy link
Contributor Author

cjubran commented Jul 10, 2024

I talked to Chuck and he told me that NFS module autoloading works fine, even checked it out for me.

So why is it failing in the initrd?

The rdma service is terminated when initrd-cleanup.service isolates for initrd-switch-root.target because initrd.target lacks the dependency on the rdma service.

@rleon rleon merged commit f6ff7a3 into linux-rdma:master Jul 18, 2024
14 checks passed
@nmorey
Copy link
Contributor

nmorey commented Aug 4, 2024

Does this actually work out of the box? I fill like SUSE/RH spec file (at least) are missing a call to dracut somewhere?

@rleon
Copy link
Member

rleon commented Aug 7, 2024

Does this actually work out of the box? I fill like SUSE/RH spec file (at least) are missing a call to dracut somewhere?

I don't think so, we saw this failure in our regressions where we run dracut anyway.
Carolina asked me if we need to add call to dracut in spec file too, but I was afraid that people will be unhappy if installation of rdma-core will require reboot after this spec file change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants