Skip to content

Commit

Permalink
Common infrastructure for auto loading rdma modules
Browse files Browse the repository at this point in the history
This is inspired by the similar approach in the redhat directory but
takes a more general approach relying on udev and systemd to do the
actual work fully dynamically instead of a oneshot shell script.

Loading is split into two cases
 1) Loading RDMA support modules when RDMA capable hardware is installed.
    This is only needed for ethernet devices which do not load their RDMA
    support modules via request_module in the kernel.

    udev is used to detect when an ethernet device controlled by a specific
    module is hot plugged and then udev directly loads the RDMA module

 2) Loading RDMA ULP support when RDMA hardware is installed
    This is done by having udev detect when RDMA hardware is installed and
    udev causes systemd to load a list of modules from config files in
    /etc/rdma/modules/

    The user can customize these files to select which ULP modules should be
    loaded.

This broadly replaces the redhat/rdma.conf scheme.

In all cases the users can prevent a module from being auto-loaded on their
system by blacking listing it in a file in /etc/modprobe.d/

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
  • Loading branch information
jgunthorpe committed Aug 3, 2017
1 parent 3bc3cee commit 2f4fb9f
Show file tree
Hide file tree
Showing 15 changed files with 285 additions and 1 deletion.
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,7 @@ configure_file("${BUILDLIB}/config.h.in" "${BUILD_INCLUDE}/config.h" ESCAPE_QUOT
add_subdirectory(ccan)
add_subdirectory(util)
add_subdirectory(Documentation)
add_subdirectory(kernel-boot)
# Libraries
add_subdirectory(libibumad)
add_subdirectory(libibumad/man)
Expand Down
84 changes: 84 additions & 0 deletions Documentation/udev.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Kernel Module Loading

The RDMA subsystem relies on the kernel, udev and systemd to load modules on
demand when RDMA hardware is present. The RDMA subsystem is unique since it
does not do not load the optional RDMA hardware modules unless the system has
the rdma-core package installed.

This is to avoid exposing systems not using RDMA from having RDMA enabled, for
instance if a system has a multi-protocol ethernet adapter, but is only using
the net stack interface.

## Boot ordering with systemd

systemd assumes everything is hot pluggable and runs in an event driven
manner. This creates a chain of hot plug events as each part of the system
autoloads based on earlier parts. The first step in the process is udev
loading the physical hardware driver.

This can happen in several spots along the bootup:

- From the initrd or built into the kernel. If hardware modules are present
in the initrd then they are loaded into the kernel before booting the
system. This is done largely synchronously with the boot process.

- From udev when it auto detects PCI hardware or otherwise.
This happens asynchronously in the boot process, systemd does not wait for
udev to finish loading modules before it continues on.

This path makes it very likely the system will experience a RDMA 'hot plug'
scenario.

- From systemd's fixed module loader systemd-modules-load.service, e.g. from
the list in /etc/modules-load.d/. In this case the modules load happens
synchronously within systemd and it will hold off sysinit.target until
modules are loaded

Once the hardware module is loaded it may be necessary to load a protocol
module, e.g. to enable RDMA support on an ethernet device.

This is triggered automatically by udev rules that match the master devices
and load the protocol module with udev's module loader. This happens
asynchronously to the rest of the systemd startup.

Once a RDMA device is created by the kernel then udev will cause systemd to
schedule ULP module loading services (e.g. rdma-load-modules@.service) specific
to the plugged hardware. If sysinit.target has not yet been passed then these
loaders will defer sysinit.target until they complete, otherwise this is a hot
plug event and things will load asynchronously to the boot up process.

Finally udev will cause systemd to start RDMA specific daemons like
srp_daemon, rdma-ndd and iwpmd. These starts are linked to the detection of
the first RDMA hardware, and the daemons internally handle hot plug events for
other hardware.

## Hot Plug compatible services

Services using RDMA need to have device specific systemd dependencies in their
unit files, either created by hand by the admin or by using udev rules.

For instance, a service that uses /dev/infiniband/umad0 requires:

```
After=dev-infiniband-umad0.device
BindsTo=dev-infiniband-umad0.device
```

Which will ensure the service will not run until the required umad device
appears, and will be stopped if the umad device is unplugged.

This is similar to how systemd handles mounting filesystems and configuring
ethernet devices.

## Interaction with legacy non-hotplug services

Services that cannot handle hot plug must be ordered after
systemd-udev-settle.service, which will wait for udev to complete loading
modules and scheduling systemd services. This ensures that all RDMA hardware
present at boot is setup before proceeding to run the le.g.acy service.

Admins using le.g.acy services can also place their RDMA hardware modules
(e.g. mlx4_ib) directly in /etc/modules-load.d/ or in their initrd which will
cause systemd to defer passing to sysinit.target until all RDMA hardware is
setup, this is usually sufficient for le.g.acy services. This is probably the
default behavior in many configurations.
9 changes: 9 additions & 0 deletions debian/rdma-core.install
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
etc/modprobe.d/mlx4.conf
etc/modprobe.d/truescale.conf
etc/rdma/modules/infiniband.conf
etc/rdma/modules/iwarp.conf
etc/rdma/modules/opa.conf
etc/rdma/modules/rdma.conf
etc/rdma/modules/roce.conf
lib/systemd/system/rdma-load-modules@.service
lib/systemd/system/rdma-ndd.service
lib/udev/rules.d/60-rdma-ndd.rules
lib/udev/rules.d/75-rdma-description.rules
lib/udev/rules.d/90-rdma-hw-modules.rules
lib/udev/rules.d/90-rdma-ulp-modules.rules
usr/bin/rxe_cfg
usr/lib/truescale-serdes.cmds
usr/sbin/rdma-ndd
Expand Down
24 changes: 24 additions & 0 deletions kernel-boot/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
rdma_subst_install(FILES rdma-load-modules@.service.in
DESTINATION "${CMAKE_INSTALL_SYSTEMD_SERVICEDIR}"
RENAME rdma-load-modules@.service
PERMISSIONS OWNER_WRITE OWNER_READ GROUP_READ WORLD_READ)

install(FILES
modules/infiniband.conf
modules/iwarp.conf
modules/opa.conf
modules/rdma.conf
modules/roce.conf
DESTINATION "${CMAKE_INSTALL_SYSCONFDIR}/rdma/modules")

install(FILES "rdma-description.rules"
RENAME "75-rdma-description.rules"
DESTINATION "${CMAKE_INSTALL_UDEV_RULESDIR}")

install(FILES "rdma-hw-modules.rules"
RENAME "90-rdma-hw-modules.rules"
DESTINATION "${CMAKE_INSTALL_UDEV_RULESDIR}")

install(FILES "rdma-ulp-modules.rules"
RENAME "90-rdma-ulp-modules.rules"
DESTINATION "${CMAKE_INSTALL_UDEV_RULESDIR}")
12 changes: 12 additions & 0 deletions kernel-boot/modules/infiniband.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# These modules are loaded by the system if any InfiniBand device is installed
# InfiniBand over IP netdevice
ib_ipoib

# Access to fabric management SMPs and GMPs from userspace.
ib_umad

# SCSI Remote Protocol target support
# ib_srpt

# ib_ucm provides the obsolete /dev/infiniband/ucm0
# ib_ucm
2 changes: 2 additions & 0 deletions kernel-boot/modules/iwarp.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# These modules are loaded by the system if any iWarp device is installed
iw_cm
10 changes: 10 additions & 0 deletions kernel-boot/modules/opa.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# These modules are loaded by the system if any OmniPath Architecture device
# is installed
# Infiniband over IP netdevice
ib_ipoib

# Access to fabric management SMPs and GMPs from userspace.
ib_umad

# Omnipath Ethernet Virtual NIC netdevice
opa_vnic
21 changes: 21 additions & 0 deletions kernel-boot/modules/rdma.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# These modules are loaded by the system if any RDMA devices is installed
# iSCSI over RDMA client support
ib_iser

# iSCSI over RDMA target support
# ib_isert

# User access to RDMA verbs (supports libibverbs)
ib_uverbs

# User access to RDMA connection management (supports librdmacm)
rdma_ucm

# RDS over RDMA support
# rds_rdma

# NFS over RDMA client support
xprtrdma

# NFS over RDMA server support
svcrdma
2 changes: 2 additions & 0 deletions kernel-boot/modules/roce.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# These modules are loaded by the system if any RDMA over Converged Ethernet
# device is installed
43 changes: 43 additions & 0 deletions kernel-boot/rdma-description.rules
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# This is a version of net-description.rules for /sys/class/infiniband devices

ACTION=="remove", GOTO="rdma_description_end"
SUBSYSTEM!="infiniband", GOTO="rdma_description_end"

# NOTE: DRIVERS searches up the sysfs path to find the driver that is bound to
# the PCI/etc device that the RDMA device is linked to. This is not the kernel
# driver that is supplying the RDMA device (eg as seen in ID_NET_DRIVER)

# FIXME: with kernel support we could actually detect the protocols the RDMA
# driver itself supports, this is a work around for lack of that support.
# In future we could do this with a udev IMPORT{program} helper program
# that extracted the ID information from the RDMA netlink.

# Hardware that supports InfiniBand
DRIVERS=="mlx4_core", ENV{ID_RDMA_INFINIBAND}="1"
DRIVERS=="mlx5_core", ENV{ID_RDMA_INFINIBAND}="1"
DRIVERS=="qib", ENV{ID_RDMA_INFINIBAND}="1"

# Hardware that supports OPA
DRIVERS=="hfi1", ENV{ID_RDMA_OPA}="1"

# Hardware that supports iWarp
DRIVERS=="cxgb3", ENV{ID_RDMA_IWARP}="1"
DRIVERS=="cxgb4", ENV{ID_RDMA_IWARP}="1"
DRIVERS=="i40e", ENV{ID_RDMA_IWARP}="1"
DRIVERS=="nes", ENV{ID_RDMA_IWARP}="1"

# Hardware that supports RoCE
DRIVERS=="be2net", ENV{ID_RDMA_ROCE}="1"
DRIVERS=="bnxt_en", ENV{ID_RDMA_ROCE}="1"
DRIVERS=="hns", ENV{ID_RDMA_ROCE}="1"
DRIVERS=="mlx4_core", ENV{ID_RDMA_ROCE}="1"
DRIVERS=="mlx5_core", ENV{ID_RDMA_ROCE}="1"
DRIVERS=="qede", ENV{ID_RDMA_ROCE}="1"
DEVPATH=="*/infiniband/rxe*", ATTR{parent}=="*", ENV{ID_RDMA_ROCE}="1"

# Setup the usual ID information so that systemd will display a sane name for
# the RDMA device units.
SUBSYSTEMS=="pci", ENV{ID_BUS}="pci", ENV{ID_VENDOR_ID}="$attr{vendor}", ENV{ID_MODEL_ID}="$attr{device}"
SUBSYSTEMS=="pci", IMPORT{builtin}="hwdb --subsystem=pci"

LABEL="rdma_description_end"
39 changes: 39 additions & 0 deletions kernel-boot/rdma-hw-modules.rules
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
ACTION=="remove", GOTO="rdma_hw_modules_end"
SUBSYSTEM!="net", GOTO="rdma_hw_modules_end"

# Automatically load RDMA specific kernel modules when a multi-function device is installed

# These drivers autoload an ethernet driver based on hardware detection and
# need userspace to load the module that has their RDMA component to turn on
# RDMA.
ENV{ID_NET_DRIVER}=="be2net", RUN{builtin}+="kmod load ocrdma"
ENV{ID_NET_DRIVER}=="bnxt_en", RUN{builtin}+="kmod load bnxt_re"
ENV{ID_NET_DRIVER}=="cxgb3", RUN{builtin}+="kmod load iw_cxgb3"
ENV{ID_NET_DRIVER}=="cxgb4", RUN{builtin}+="kmod load iw_cxgb4"
ENV{ID_NET_DRIVER}=="hns", RUN{builtin}+="kmod load hns_roce"
ENV{ID_NET_DRIVER}=="i40e", RUN{builtin}+="kmod load i40iw"
ENV{ID_NET_DRIVER}=="mlx4_en", RUN{builtin}+="kmod load mlx4_ib"
ENV{ID_NET_DRIVER}=="mlx5_core", RUN{builtin}+="kmod load mlx5_ib"
ENV{ID_NET_DRIVER}=="qede", RUN{builtin}+="kmod load qedr"

# The user must explicitly load these modules via /etc/modules-load.d/ or otherwise
# rxe

# When in IB mode the kernel PCI core module autoloads the protocol modules
# for these providers
# mlx4
# mlx5

# enic no longer has a userspace verbs driver, this rule should probably be
# owned by libfabric
ENV{ID_NET_DRIVER}=="enic", RUN{builtin}+="kmod load usnic_verbs"

# These providers are single function and autoload RDMA automatically based on
# PCI probing
# hfi1verbs
# ipathverbs
# mthca
# vmw_pvrdma
# nes

LABEL="rdma_hw_modules_end"
16 changes: 16 additions & 0 deletions kernel-boot/rdma-load-modules@.service.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[Unit]
Description=Load RDMA modules from @CMAKE_INSTALL_FULL_SYSCONFDIR@/rdma/modules/%I.conf
Documentation=file:@CMAKE_INSTALL_FULL_DOCDIR@/udev.md
DefaultDependencies=no
Conflicts=shutdown.target
# network-pre.target is to support distro network setup scripts that run after
# systemd-modules-load.service but before sysinit.target, eg a classic network
# setup script.
Before=sysinit.target shutdown.target network-pre.target
ConditionCapability=CAP_SYS_MODULE

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/lib/systemd/systemd-modules-load @CMAKE_INSTALL_FULL_SYSCONFDIR@/rdma/modules/%I.conf
TimeoutSec=90s
11 changes: 11 additions & 0 deletions kernel-boot/rdma-ulp-modules.rules
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
ACTION=="remove", GOTO="rdma_ulp_modules_end"
SUBSYSTEM!="infiniband", GOTO="rdma_ulp_modules_end"

# Automatically load general RDMA ULP modules when RDMA hardware is installed
TAG+="systemd", ENV{SYSTEMD_WANTS}+="rdma-load-modules@rdma.service"
TAG+="systemd", ENV{ID_RDMA_INFINIBAND}=="1", ENV{SYSTEMD_WANTS}+="rdma-load-modules@infiniband.service"
TAG+="systemd", ENV{ID_RDMA_IWARP}=="1", ENV{SYSTEMD_WANTS}+="rdma-load-modules@iwarp.service"
TAG+="systemd", ENV{ID_RDMA_OPA}=="1", ENV{SYSTEMD_WANTS}+="rdma-load-modules@opa.service"
TAG+="systemd", ENV{ID_RDMA_ROCE}=="1", ENV{SYSTEMD_WANTS}+="rdma-load-modules@roce.service"

LABEL="rdma_ulp_modules_end"
1 change: 1 addition & 0 deletions rdma-core.spec
Original file line number Diff line number Diff line change
Expand Up @@ -142,4 +142,5 @@ rm -rf %{buildroot}/%{my_unitdir}/
%config %{_sysconfdir}/iwpmd.conf
%config %{_sysconfdir}/srp_daemon.conf
%config %{_sysconfdir}/libibverbs.d/*
%config %{_sysconfdir}/rdma/modules/*
%{_sysconfdir}/modprobe.d/*
11 changes: 10 additions & 1 deletion redhat/rdma-core.spec
Original file line number Diff line number Diff line change
Expand Up @@ -322,17 +322,26 @@ rm -rf %{buildroot}/%{_sbindir}/srp_daemon.sh
%doc %{_docdir}/%{name}-%{version}/README.md
%doc %{_docdir}/%{name}-%{version}/rxe.md
%config(noreplace) %{_sysconfdir}/rdma/mlx4.conf
%config(noreplace) %{_sysconfdir}/rdma/modules/infiniband.conf
%config(noreplace) %{_sysconfdir}/rdma/modules/iwarp.conf
%config(noreplace) %{_sysconfdir}/rdma/modules/opa.conf
%config(noreplace) %{_sysconfdir}/rdma/modules/rdma.conf
%config(noreplace) %{_sysconfdir}/rdma/modules/roce.conf
%config(noreplace) %{_sysconfdir}/rdma/rdma.conf
%config(noreplace) %{_sysconfdir}/rdma/sriov-vfs
%config(noreplace) %{_sysconfdir}/udev/rules.d/*
%config(noreplace) %{_sysconfdir}/modprobe.d/mlx4.conf
%config(noreplace) %{_sysconfdir}/modprobe.d/truescale.conf
%{_sysconfdir}/sysconfig/network-scripts/*
%{_unitdir}/rdma-load-modules@.service
%{_unitdir}/rdma.service
%dir %{dracutlibdir}/modules.d/05rdma
%{dracutlibdir}/modules.d/05rdma/module-setup.sh
%{_udevrulesdir}/98-rdma.rules
%{_udevrulesdir}/60-rdma-ndd.rules
%{_udevrulesdir}/75-rdma-description.rules
%{_udevrulesdir}/90-rdma-hw-modules.rules
%{_udevrulesdir}/90-rdma-ulp-modules.rules
%{_udevrulesdir}/98-rdma.rules
%{sysmodprobedir}/libmlx4.conf
%{sysmodprobedir}/cxgb3.conf
%{sysmodprobedir}/cxgb4.conf
Expand Down

0 comments on commit 2f4fb9f

Please sign in to comment.