Skip to content

LLNL/msr-safe

Repository files navigation

NAME

msr_safe - kernel module implementing access-control lists for model-specific registers

SYNOPSIS

/dev/cpu/<cpuid>/msr_safe /dev/cpu/msr_batch /dev/cpu/msr_allowlist /dev/cpu/msr_version msr-save

OVERVIEW

msr_safe provides controlled userspace access to model-specific registers (MSRs). It allows system administrators to give register-level read access and bit-level write access to trusted users in production environments. This access is useful where kernel drivers have not caught up with new processor features, or performance constraints requires batch access across dozens or hundreds of registers.

SETUP

Building the kernel module requires linux kernel headers. Best practice for production environments requires creation of msr-user and msr-admin groups. Members of the former can read and write MSRs using either the per-CPU interface or the batch interface, subject to the restrictions specified in the allowlist. Members of the latter can also change the contents of the allowlist.

git clone https://github.com/LLNL/msr-safe
cd msr-safe
make
sudo insmod ./msr-safe.ko
sudo chmod g+rw /dev/cpu/*/msr_safe /dev/cpu/msr_*
sudo chgrp msr-user /dev/cpu/*/msr_safe /dev/cpu/msr_batch /dev/cpu/msr_version
sudo chgrp msr-admin /dev/cpu/msr_allowlist

msr_safe uses dynamically allocated major device numbers. These can conflict with devices that use hard-coded numbers. To work around this, major device numbers can be specified during module load.

sudo insmod msr-safe.ko \
                [ mdev_msr_safe=<#> ] \
                [ mdev_msr_allowlist=<#> ] \
                [ mdev_msr_batch=<#> ] \
                [ mdev_msr_version=<#> ] 

Use rmmod(8) to unload msr-safe.

sudo rmmod msr-safe

DESCRIPTION

/dev/cpu/msr_allowlist

Contains a list of model specific registers and their writemasks. Supports read(2), write(2) and open(2). Any MSR access using msr_safe or msr_batch is checked against this list. An MSR can be read if its address is present in the list. An MSR can only be written if its address is present in the list and there is at least one bit writable as indicated by the write mask. For example, the following entry marks the MSR at address 0x10 (the time stamp counter) as read-only, as the write mask is 0.

0x00000010 0x0000000000000000 # "MSR_TIME_STAMP_COUNTER"

This entry allows MSR_PERF_CTL (at address 0x199) to be read, but only the bottom sixteen bits are writeable.

0x00000199 0x000000000000ffff # "MSR_PERF_CTL"

It is up to the system administrator to create appropriate per-architecture, per-user allowlists. The "safety" of a particular MRS depends on the totality of the environment. The msr-safe repo provides sample allowlists that have been useful in other installations; they may or may not be appropriate for yours.

To see the existing allowlist:

cat /dev/cpu/msr_allowlist

The output will look something like:

# MSR      Write mask
0x00000010 0x0000000000000000
0x00000017 0x0000000000000000
0x000000C1 0x0000000000000000
...

Comments are not preserved.

To install a new allowlist:

cat <new_allowlist> > /dev/cpu/msr_allowlist

Writing, appending, or modifying a loaded allowlist discards the existing allowlist.

Parsing a new allowlist is done in two passes. If an error occurs during the first pass the existing allowlist is undisturbed. If an error occurs during the second pass the allowlist is reset to be empty. In practice, the most common second-phase error is the discovery of a duplicate allowlist entry. See ERRORS for details.

/dev/cpu/<cpuid>/msr_safe

Per logical-cpu interface for model-specific registers. Supports llseek(2), read(2), write(2), and open(2). Reads or writes a single MSR at a time. To access multiple MSRs and/or MSRs across multiple logical CPUs, use /dev/cpu/msr_batch.

The most common approach is to use pread(2) and pwrite(2), as these combine the seek operation with reading and writing. Alternatively, the device supports SEEK_SET and SEEK_CUR parameters to llseek(2), but not SEEK_END. Both reads and and writes must be exactly 8 bytes.

/dev/cpu/msr_batch

Batch interface for MSR access. Only supports ioctl(2), with the first parameter being the file descriptor, the second parameter being X86_IOC_MSR_BATCH (defined in msr_safe.h), and the third parameter being a pointer to a struct msr_batch_array.

struct msr_batch_array
{
    __u32 numops;             // In: # of operations in operations array
    __u32 version;            // In: MSR_SAFE_VERSION_u32 (see msr_version.h)
    struct msr_batch_op *ops; // In: Array[numops] of operations
};

The maximum numops is system-dependent, but 30k operations is not unheard-of.

Starting in version 2.0.0, the version field will be compared to the version of the loaded kernel module with a mismatch resulting in an error. Earlier versions do not check this field.

Each op is contained in a struct msr_batch_op:

struct msr_batch_op
{
    __u16 cpu;     // In: CPU to execute {rd/wr}msr instruction
    __u16 isrdmsr; // In: 0=wrmsr, non-zero=rdmsr
    __s32 err;     // Out: set if error occurred with this operation
    __u32 msr;     // In: MSR address
    __u64 msrdata; // In/Out: Data to write or data that was read
    __u64 wmask;   // Out: Write mask applied to wrmsr
};

The cpu uses the same numbering found in /dev/cpu/<cpuid>. A zero value for isrdmsr indicates a write operation, any other value indicates a read operation. err is populated by the kernel if there is an error on a particular operation, and will be one of ENXIO (the virtual CPU does not exist or is offline), EACCES (the requested MSR was not found in the allowlist), or EROFS (a write operation was attempted on an MSR with a write mask of 0).

msr is the address of the model-specific register. msrdata is the value that will be written to or read from the MSR, respectively. Finally, the wmask records the writemask for the MSR provided in the allowlist.

/dev/cpu/msr_safe_version

Starting with version 1.6, this device contains the loaded version of msr-safe.

RETURN VALUES

On success, calls to write(2) and read(2) return the number of bytes written or read, which in the case of /dev/cpu/<cpu>/msr_safe will be 8 (as only a single register per call may be written to or read from). llseek(2) returns the new file offset. open(2) returns the new file descriptor. ioctl(2) returns 0.

On error, All of the following system calls will return -1 and set errno to the appropriate value. The errors listed below are specific to msr_safe. The man pages for the individual system calls describe additional errors that may occur.

ERRORS

/dev/cpu/msr_allowlist

write(2)

E2BIG <count> exceeds MAX_WLIST_BSIZE (defined as (128 * 1024) + 1)

EILSEQ Unexpected EOF.

EINVAL Address or writemask caused parsing error.

EFAULT Kernel copy_from_user() failed.

ENOMEM Kernel unable to allocate memory to hold the raw or parsed allowlist.

ENOMSG No valid allowlist entries found.

ENOTUNIQ Duplicate allowlist entries found.

ERANGE Address or writemask is too large for an unsigned long long.

read(2)

E2BIG The read(2) <count> parameter was less than 60 bytes.

EFAULT Kernel copy_from_user() failed.

llseek(2)

EINVAL The <whence> parameter was neither SEEK_CUR nor SEEK_SET, e.g., SEEK_END.

/dev/cpu/<cpuid>/msr_safe

read(2)

EACCESS The MSR requested is not in the allowlist.

EBUSY Requested virtual CPU is (temporarily?) locked.

EFAULT Kernel copy_to_user() failed.

EIO A general protection fault occurred. See the description for EIO errors in the /dev/cpu/msr_batch section below.

EINVAL Number of bytes requested to read is something other than 8.

ENXIO Requested virtual CPU does not exist or is offline.

write(2)

EACCESS The MSR requested is not in the allowlist.

EBUSY Requested virtual CPU is (temporarily?) locked.

EFAULT Kernel copy_from_user() failed.

EIO A general protection fault occurred. See the description for EIO errors in the /dev/cpu/msr_batch section below.

EINVAL Number of bytes requested to read is something other than 8.

ENXIO Requested virtual CPU does not exist or is offline.

open(2)

EIO Model-specific registers not supported on this virtual CPU.

ENXIO Requested virtual CPU does not exist or is offline.

/dev/cpu/msr_batch

ioctl(2)

All of the operations in the batch will be executed. Each operation may result in an EIO, ENXIO, EACCES, or EROFS error, which will be recorded in the msr_batch_op struct. If any operation caused an error, the first such error becomes the return value for ioctl(2).

E2BIG Kernel unable to allocate memory to hold the array of operations.

EACCES An individual operation requested an MSR that is not present in the allowlist.

EBADF The msr_batch file was not opened for reading.

EFAULT Kernel copy_from_user() or copy_to_user() failed.

EINVAL Number of requested batch operations is <=0.

EIO A general protection fault occurred. On Intel processors this can be caused by a) attempting to access an MSR outside of ring 0, b) attempting to access a non-existent or reserved MSR address, c) writing 1-bits to a reserved area of an MSR, d) writing a non-canonical address to MSRs that take memory addresses, or e) writing to MSR bits that are marked as read-only.

ENOMEM Kernel unable to allocate memory to hold the results of zalloc_cpumask_var().

ENOTTY Invalid ioctl command. As of this writing the only ioctl command supported on this device is X86_IOC_MSR_BATCH, defined in msr_safe.h.

ENXIO An individual operation requested a virtual CPU does not exist or is offline.

EROFS An individual operation requested a write to a read-only MSR.

open(2)

There are no msr_safe-specific error conditions.

ALLOWLIST CONTRIBUTIONS

The contribs/ folder contains allowlists shared by our external collaborators, organized by site-name.

msr-save

The msrsave utility provides a mechanism for saving and restoring MSR values based on entries in the allowlist. To restore MSR values, the register must have an appropriate writemask.

Modification of MSRs that are marked as safe in the allowlist may impact subsequent users on a shared HPC system. It is important the resource manager on such a system use the msrsave utility to save and restore MSR values between allocating compute nodes to users. An example of this has been implemented for the SLURM resource manager as a SPANK plugin. This plugin can be built with the "make spank" target and installed with the "make install-spank" target. This uses the SLURM SPANK infrastructure to make a popen(3) call to the msrsave command line utility in the job epilogue and prologue.

The version of msrsave (and msr-safe) can be modified by updating the following compiler flag:

-DVERSION=\"MAJOR.MINOR.PATCH\"

The msrsave version can be queried with:

msrsave --version

Security

Model-specific registers

The safety of a particular model-specific register depends on the system environment. The sample allowlists provided were developed for non-classified high performance computing systems where only a single non-privileged user at a time can access a given compute node. These lists should be re-evaluated for use in other environments, particularly multi-user environments.

Filesystems permissions

msr-safe is designed to support multiple classes of users, each of which would have their own group and allowlist. Best practice is to unload and reload the msr-safe kernel module when changing device ownership or permissions. If this is not done, a lower-privileged user can open /dev/cpu/msr_batch and retain the file descriptor until the permissions (and allowlist) are changed to allow higher-privileged users to run and the allowlist remains readable by the less-privileged user, the less-privileged user can continue using their original file descriptor with the higher-privileged allowlist.

FAQ

Can I append or modify an allowlist in place?

No. Each write(2) call discards the previous allowlist.

What happens if an allowlist is changed during an ioctl(2) call?

The kernel records all of the relevant writemasks in the struct msr_batch_op prior to executing the ops. If the allowlist is changed during a call, the new allowlist will be applied to subsequent calls.

How many operations can fit into one batch?

Determining the formula to provide an upper bound is almost certainly more trouble than it's worth, but we have easily gotten 30k entries in a single batch on production machines.

What happens if a CPU is taken offline or brought back online?

We haven't had a good reason to wire up hotplugging. If the collection of online CPUs changes, it's best to unload and reload the msr-safe kernel module.

What happens if a CPU is taken offline and a user still has an open file descriptor for that device?

The kernel checks to see if a CPU is online. Attempts to access MSRs using that file descriptor should generate and error.

Can the batch API be extended to do other operations such as polling?

It can and it has. If you need this functionality please let us know. The code is brittle enough that we don't use it in production, but we are happy to share.

EXAMPLE CODE

/* This example assumes the user has the following permissions:
 *
 * write        /dev/cpu/msr_allowlist
 * read/write   /dev/cpu/<cpu_number>/msr_safe
 * read         /dev/cpu/msr_batch
 *
 * Typically, only the administrator will have write permissions
 * on the allowlist.
 *
 * Production code should have more robust error handling than
 * what is shown here.
 *
 * This example should be able to run successfully on an x86
 * processor from the past ten years or so.
 *
 */


#include <stdio.h>      // printf(3)
#include <assert.h>     // assert(3)
#include <fcntl.h>      // open(2)
#include <unistd.h>     // write(2), pwrite(2), pread(2)
#include <string.h>     // strlen(3), memset(3)
#include <stdint.h>     // uint8_t
#include <inttypes.h>   // PRIu8
#include <stdlib.h>     // exit(3)
#include <sys/ioctl.h>  // ioctl(2)

#include "../msr_safe.h"   // batch data structs

#define MSR_MPERF 0xE7

char const *const allowlist = "0xE7 0xFFFFFFFFFFFFFFFF\n";  // MPERF

static uint8_t const nCPUs = 32;

void set_allowlist()
{
    int fd = open("/dev/cpu/msr_allowlist", O_WRONLY);
    assert(-1 != fd);
    ssize_t nbytes = write(fd, allowlist, strlen(allowlist));
    assert(strlen(allowlist) == nbytes);
    close(fd);
}

void measure_serial_latency()
{
    int fd[nCPUs], rc;
    char filename[255];
    uint64_t data[nCPUs];
    memset(data, 0, sizeof(uint64_t)*nCPUs);

    // Open each of the msr_safe devices (one per CPU)
    for (uint8_t i = 0; i < nCPUs; i++)
    {
        rc = snprintf(filename, 254, "/dev/cpu/%"PRIu8"/msr_safe", i);
        assert(-1 != rc);
        fd[i] = open(filename, O_RDWR);
        assert(-1 != fd[i]);
    }
    // Write 0 to each MPERF register
    for (uint8_t i = 0; i < nCPUs; i++)
    {
        rc = pwrite(fd[i], &data[i], sizeof(uint64_t), MSR_MPERF);
        assert(8 == rc);
    }

    // Read each MPERF register
    for (uint8_t i = 0; i < nCPUs; i++)
    {
        pread(fd[i], &data[i], sizeof(uint64_t), MSR_MPERF);
        assert(8 == rc);
    }

    // Show results
    printf("Serial cycles from first write to last read:"
           "%"PRIu64" (on %"PRIu8" CPUs)\n",
           data[nCPUs - 1], nCPUs);
}

void measure_batch_latency()
{
    struct msr_batch_array rbatch, wbatch;
    struct msr_batch_op r_ops[nCPUs], w_ops[nCPUs];
    int fd, rc;

    fd = open("/dev/cpu/msr_batch", O_RDONLY);
    assert(-1 != fd);

    for (uint8_t i = 0; i < nCPUs; i++)
    {
        r_ops[i].cpu = w_ops[i].cpu = i;
        r_ops[i].isrdmsr = 1;
        w_ops[i].isrdmsr = 0;
        r_ops[i].msr = w_ops[i].msr = MSR_MPERF;
        w_ops[i].msrdata = 0;
    }
    rbatch.numops = wbatch.numops = nCPUs;
    rbatch.ops = r_ops;
    wbatch.ops = w_ops;

    rc = ioctl(fd, X86_IOC_MSR_BATCH, &wbatch);
    assert(-1 != rc);
    rc = ioctl(fd, X86_IOC_MSR_BATCH, &rbatch);
    assert(-1 != rc);

    printf("Batch cycles from first write to last read:"
           "%llu (on %"PRIu8" CPUs)\n",
           r_ops[nCPUs - 1].msrdata, nCPUs);
}

int main()
{
    set_allowlist();
    measure_serial_latency();
    measure_batch_latency();
    return 0;
}

Release

msr-safe is released under the GPL v2.0 license. For more details, please see the LICENSE and NOTICE files.

SPDX-License-Identifier: GPL-2.0-only

LLNL-CODE-807679

License and LLNL release number have been corrected to match internal records.