Skip to content

Commit

Permalink
chore(bors): merge pull request #892
Browse files Browse the repository at this point in the history
892: fix(csi-node): handle devices for existing subsystems r=tiagolobocastro a=tiagolobocastro

When a device is broken but the volume is not unstaged properly, the subsytem ends up lingering in the system.
If the device is staged again, the device is assined a different namespace id. Change the regex which is used in the device parameter fo account for this. Also extend the existing csi-node BDD testing.

Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>
  • Loading branch information
mayastor-bors and tiagolobocastro committed Dec 2, 2024
2 parents 00190e9 + 18be1cf commit 8192704
Show file tree
Hide file tree
Showing 7 changed files with 182 additions and 30 deletions.
39 changes: 21 additions & 18 deletions control-plane/csi-driver/src/bin/node/dev/nvmf.rs
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
use nvmeadm::{
error::NvmeError,
nvmf_discovery::{disconnect, ConnectArgsBuilder, TrType},
};
use std::{
collections::HashMap,
convert::{From, TryFrom},
path::Path,
str::FromStr,
};

use nvmeadm::{
error::NvmeError,
nvmf_discovery::{disconnect, ConnectArgsBuilder, TrType},
};

use csi_driver::PublishParams;
use glob::glob;
use nvmeadm::nvmf_subsystem::Subsystem;
Expand All @@ -29,7 +28,7 @@ use crate::{
use super::{Attach, Detach, DeviceError, DeviceName};

lazy_static::lazy_static! {
static ref DEVICE_REGEX: Regex = Regex::new(r"nvme(\d{1,3})n1").unwrap();
static ref DEVICE_REGEX: Regex = Regex::new(r"nvme(\d{1,5})n(\d{1,5})").unwrap();
}

pub(super) struct NvmfAttach {
Expand Down Expand Up @@ -153,6 +152,12 @@ impl Attach for NvmfAttach {
if let Some(keep_alive_tmo) = nvme_config.keep_alive_tmo() {
self.keep_alive_tmo = Some(keep_alive_tmo);
}
if self.io_tmo.is_none() {
if let Some(io_tmo) = publish_context.io_timeout() {
self.io_tmo = Some(*io_tmo);
}
}

Ok(())
}

Expand Down Expand Up @@ -221,18 +226,16 @@ impl Attach for NvmfAttach {
.get_device()?
.ok_or_else(|| DeviceError::new("NVMe device not found"))?;
let dev_name = device.sysname().to_str().unwrap();
let major = DEVICE_REGEX
.captures(dev_name)
.ok_or_else(|| {
DeviceError::new(&format!(
"NVMe device \"{}\" does not match \"{}\"",
dev_name, *DEVICE_REGEX,
))
})?
.get(1)
.unwrap()
.as_str();
let pattern = format!("/sys/class/block/nvme{major}c*n1/queue");
let captures = DEVICE_REGEX.captures(dev_name).ok_or_else(|| {
DeviceError::new(&format!(
"NVMe device \"{}\" does not match \"{}\"",
dev_name, *DEVICE_REGEX,
))
})?;
let major = captures.get(1).unwrap().as_str();
let nid = captures.get(2).unwrap().as_str();

let pattern = format!("/sys/class/block/nvme{major}c*n{nid}/queue");
let glob = glob(&pattern).unwrap();
let result = glob
.into_iter()
Expand Down
20 changes: 18 additions & 2 deletions tests/bdd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,61 +2,77 @@

The BDD tests are written in Python and make use of the pytest-bdd library.

The feature files in the `features` directory define the behaviour expected of the control plane. These behaviours are described using the [Gherkin](https://cucumber.io/docs/gherkin/) syntax.
The feature files in the `features` directory define the behaviour expected of the control plane. These behaviours are
described using the [Gherkin](https://cucumber.io/docs/gherkin/) syntax.

The feature files can be used to auto-generate the test file. For example running `pytest-bdd generate replicas.feature > test_volume_replicas.py`
The feature files can be used to auto-generate the test file. For example
running `pytest-bdd generate replicas.feature > test_volume_replicas.py`
generates the `test_volume_replicas.py` test file from the `replicas.feature` file.
When updating the feature file, you can also get some helpe updating the python code.
Example: `pytest --generate-missing --feature node.feature test_node.py`

**:warning: Note: Running pytest-bdd generate will overwrite any existing files with the same name**

## Running the Tests by entering the python virtual environment

Before running any tests run the `setup.sh` script. This sets up the necessary environment to run the tests:

```bash
# NOTE: you should be inside the nix-shell to begin
source ./setup.sh
```

To run all the tests:

```bash
pytest .
```

To run individual test files:

```bash
pytest features/volume/create/test_feature.py
```

To run an individual test within a test file use the `-k` option followed by the test name:

```bash
pytest features/volume/create/test_feature.py -k test_sufficient_suitable_pools
```

## Running the Tests

The script in `../../scripts/python/test.sh` can be used to run the tests without entering the venv.
This script will implicitly enter and exit the venv during test execution.

To run all the tests:

```bash
../../scripts/python/test.sh
```

Arguments will be passed directly to pytest. Example running individual tests:

```bash
../../scripts/python/test.sh features/volume/create/test_feature.py -k test_sufficient_suitable_pools
```

## Debugging the Tests

Typically, the test code cleans up after itself and so it's impossible to debug the test cluster.
The environmental variable `CLEAN` can be set to `false` to skip tearing down the cluster when a test ends.
It should be paired with the pytest option `-x` or `--exitfirst` to exit when the first test fails otherwise the other
tests may end up tearing down the cluster anyway.

Example:

```bash
CLEAN=false ../../scripts/python/test.sh features/volume/create/test_feature.py -k test_sufficient_suitable_pools -x
```

The script `../../scripts/python/test.sh` does a lot of repetitive work of regenerating the auto-generated code.
This step can be skipped with the `FAST` environment variable to speed up the test cycle:

```bash
FAST=true ../../scripts/python/test.sh features/volume/create/test_feature.py -k test_sufficient_suitable_pools -x
```
5 changes: 5 additions & 0 deletions tests/bdd/common/deployer.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,11 @@ def node_name(id: int):
assert id >= 0
return f"io-engine-{id + 1}"

@staticmethod
def csi_node_name(id: int):
assert id >= 0
return f"csi-node-{id + 1}"

@staticmethod
def create_disks(len=1, size=100 * 1024 * 1024):
host_tmp = workspace_tmp()
Expand Down
6 changes: 6 additions & 0 deletions tests/bdd/common/docker.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,12 @@ def unpause_container(name):
container = docker_client.containers.get(name)
container.unpause()

@staticmethod
def execute(name, commands):
docker_client = docker.from_env()
container = docker_client.containers.get(name)
return container.exec_run(commands)

# Restart a container with the given name.
def restart_container(name):
docker_client = docker.from_env()
Expand Down
22 changes: 18 additions & 4 deletions tests/bdd/common/nvme.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,21 +307,35 @@ def identify_namespace(device):

def nvme_set_reconnect_delay_all(nqn, delay=10):
for controller in nvme_find_controllers(nqn):
nvme_controller_set_reconnect_delay(controller, delay)
nvme_controller_set_param(controller, "reconnect_delay", delay)


def nvme_set_reconnect_delay(uri, delay=10):
controller = nvme_find_controller(uri)
nvme_controller_set_reconnect_delay(controller, delay)
nvme_controller_set_param(controller, "reconnect_delay", delay)


def nvme_controller_set_reconnect_delay(controller, delay=10):
def nvme_set_ctrl_loss_tmo(uri, ctrl_loss_tmo=10):
controller = nvme_find_controller(uri)
nvme_controller_set_param(controller, "ctrl_loss_tmo", ctrl_loss_tmo)


def nvme_controller_set_param(controller, name: str, value: int):
controller = controller.get("Controller")
command = f"echo {delay} | sudo tee -a /sys/class/nvme/{controller}/reconnect_delay"
command = f"echo {value} | sudo tee -a /sys/class/nvme/{controller}/{name}"
print(command)
subprocess.run(command, check=True, shell=True, capture_output=True)


@retry(wait_fixed=100, stop_max_attempt_number=40)
def wait_nvme_find_device(uri):
return nvme_find_device(uri)


@retry(wait_fixed=100, stop_max_attempt_number=40)
def wait_nvme_gone_device(uri, nqn=None):
if nqn is None:
u = urlparse(uri)
nqn = u.path[1:]
devs = nvme_find_subsystem_devices(nqn)
assert len(devs) == 0
13 changes: 13 additions & 0 deletions tests/bdd/features/csi/node/node.feature
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,16 @@ Feature: CSI node plugin
Scenario: publishing a reader only block volume as rw
Given a block volume staged as "MULTI_NODE_READER_ONLY"
When publishing the block volume as "rw" should fail

Scenario: re-staging after controller loss timeout
Given a staged volume
When the kernel device is removed after a controller loss timeout
Then the volume should be unstageable
And the volume should be stageable again

Scenario: re-staging after controller loss timeout without unstaging
Given a staged volume
When the kernel device is removed after a controller loss timeout
Then the mounts become broken
But the volume should be stageable on a different path
And the nvme device should have a different controller and namespace
Loading

0 comments on commit 8192704

Please sign in to comment.