CCI-MOC · knikolla · Oct 17, 2024 · Sep 6, 2024 · Sep 6, 2024 · Sep 10, 2024
diff --git a/Dockerfile b/Dockerfile
@@ -16,6 +16,7 @@ COPY meson.* /app/
 COPY src /app/src
 COPY subprojects /app/subprojects
 COPY test /app/test
+COPY tools /app/tools
 
 RUN make install-deps \
     && make release
diff --git a/Makefile b/Makefile
@@ -33,3 +33,5 @@ install-deps:
 	# LSVD deps
 	sudo apt install -y meson mold libfmt-dev librados-dev \
     	libjemalloc-dev libradospp-dev pkg-config uuid-dev ceph-common
+	# to make my life a little easier
+	sudo apt install -y gdb fish
diff --git a/README.md b/README.md
@@ -13,14 +13,6 @@ Note that although individual disk performance is important, the main goal is to
 be able to support higher aggregate client IOPS against a given backend OSD
 pool.
 
-## what's here
-
-This builds `liblsvd.so`, which provides most of the basic RBD API; you can use
-`LD_PRELOAD` to use this in place of RBD with `fio`, KVM/QEMU, and a few other
-tools. It also includes some tests and tools described below.
-
-The repository also includes scripts to setup a SPDK NVMeoF target.
-
 ## Stability
 
 This is NOT production-ready code; it still occasionally crashes, and some
@@ -30,66 +22,105 @@ It is able to install and boot Ubuntu 22.04 (see `qemu/`) and is stable under
 most of our tests, but there are likely regressions around crash recovery and
 other less well-trodden paths.
 
-## Build
+## How to run
 
-This project uses `meson` to manage the build system. Run `make setup` to
-generate the build files, then run `meson compile` in either `build-rel` or
-`build-dbg` to build the release or debug versions of the code.
+Note that the examples here use the fish shell, that the local nvme cache is
+`/dev/nvme0n1`, and that the ceph config files are available in `/etc/ceph`.
 
-A makefile is also offered for convenience; `make` builds the debug version
-by default.
+```
+echo 4096 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
+sudo docker run --net host -v /dev/hugepages:/dev/hugepages -v /etc/ceph:/etc/ceph -v /var/tmp:/var/tmp -v /dev/shm:/dev/shm -v /mnt/nvme0:/lsvd -i -t --privileged --entrypoint /usr/bin/fish ghcr.io/cci-moc/lsvd-rbd:main
+```
 
-## Configuration
+If you run into an error, you might need to rebuild the image:
 
-LSVD is not yet merged into the Ceph configuration framework, and uses its own
-system.  It reads from a configuration file (`lsvd.conf` or
-`/usr/local/etc/lsvd.conf`) or from environment variables of the form
-`LSVD_<NAME>`, where NAME is the upper-case version of the config file variable.
-Default values can be found in `config.h`
-
-Parameters are:
-
-- `batch_size`, `LSVD_BATCH_SIZE`: size of objects written to the backend, in bytes (K/M recognized as 1024, 1024\*1024). Default: 8MiB
-- `wcache_batch`: write cache batching (see below)
-- `wcache_chunk': maximum size of atomic write, in bytes - larger writes will be split and may be non-atomic.
-- `rcache_dir` - directory used for read cache file and GC temporary files. Note that `imgtool` can format a partition for cache and symlink it into this directory, although the performance improvement seems limited.
-- `wcache_dir` - directory used for write cache file
-- `xlate_window`: max writes (i.e. objects) in flight to the backend. Note that this value is coupled to the size of the write cache, which must be big enough to hold all outstanding writes in case of a crash.
-- `hard_sync` (untested): "flush" forces all batched writes to the backend.
-- `backend`: "file" or "rados" (default rados). The "file" backend is for testing only
-- `cache_size` (bytes, K/M/G): total size of the cache file. Currently split 1/3 write, 2/3 read. Ignored if the cache file already exists.
-- `ckpt_interval` N: limits the number of objects to be examined during crash recovery by flushing metadata every N objects.
-- `flush_msec`: timeout for flushing batched writes
-- `gc_threshold` (percent): described below
-
-Typically the only parameters that need to be set are `cache_dir` and
-`cache_size`.  Parameters may be added or removed as we tune things and/or
-figure out how to optimize at runtime instead of bothering the user for a value.
-
-## Using LSVD with fio and QEMU
-
-First create a volume:
 ```
-build$ sudo imgtool create poolname imgname --size=20g 
+git clone https://github.com/cci-moc/lsvd-rbd.git
+cd lsvd-rbd
+docker build -t lsvd-rbd .
+sudo docker run --net host -v /dev/hugepages:/dev/hugepages -v /etc/ceph:/etc/ceph -v /var/tmp:/var/tmp -v /dev/shm:/dev/shm -v /mnt/nvme0:/lsvd -i -t --privileged --entrypoint /usr/bin/fish lsvd-rbd
 ```
 
-Then you can start a SPDK NVMe-oF gateway:
+To start the gateway:
+
 ```
-./qemu/qemu-gateway.sh pool imgname
+./build-rel/lsvd_tgt
 ```
 
-Then connect to the NVMe-oF gateway:
+The target will start listening to rpc commands on `/var/tmp/spdk.sock`.
+
+To create an lsvd image on the backend:
 
 ```
-nvme connect -t tcp -n nqn.2016-06.io.spdk:cnode1 -a
+#./imgtool create <pool> <imgname> --size 100g
+./imgtool create lsvd-ssd benchtest1 --size 100g
 ```
 
-You should now have just a plain old NVMe device, with which you can use just
-like any other NVMe device.
+To configure nvmf:
+
+```
+cd subprojects/spdk/scripts
+./rpc.py nvmf_create_transport -t TCP -u 16384 -m 8 -c 8192
+./rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller1
+./rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t tcp -a 0.0.0.0 -s 9922
+```
+
+To mount images on the gateway:
+
+```
+export PYTHONPATH=/app/src/
+./rpc.py --plugin rpc_plugin bdev_lsvd_create lsvd-ssd benchtest1 -c '{"rcache_dir":"/lsvd","wlog_dir":"/lsvd"}'
+./rpc.py nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 benchtest1
+```
+
+To gracefully shutdown gateway:
+
+```
+./rpc.py --plugin rpc_plugin bdev_lsvd_delete benchtest1
+./rpc.py spdk_kill_instance SIGTERM
+docker kill <container id>
+```
+
+## Mount a client
+
+Fill in the appropriate IP address:
+
+```
+modprobe nvme-fabrics
+nvme disconnect -n nqn.2016-06.io.spdk:cnode1
+export gw_ip=${gw_ip:-192.168.52.109}
+nvme connect -t tcp  --traddr $gw_ip -s 9922 -n nqn.2016-06.io.spdk:cnode1 -o normal
+sleep 2
+nvme list
+dev_name=$(nvme list | perl -lane 'print @F[0] if /SPDK/')
+printf "Using device $dev_name\n"
+```
+
+## Build
+
+This project uses `meson` to manage the build system. Run `make setup` to
+generate the build files, then run `meson compile` in either `build-rel` or
+`build-dbg` to build the release or debug versions of the code.
+
+A makefile is also offered for convenience; `make` builds the debug version
+by default.
+
+## Configuration
+
+LSVD is configured using a JSON file. When creating an image, we will
+try to read the following paths and parse them for configuration options:
 
-Do not use multiple fio jobs on the same image - currently there's no protection
-and they'll stomp all over each other.  RBD performs horribly in that case, but
-AFAIK it doesn't compromise correctness.
+- Default built-in configuration
+- `/usr/local/etc/lsvd.json`
+- `./lsvd.json`
+- user supplied path
+
+The file read last has highest priority.
+
+We will also first try to parse the user-supplied path as a JSON object, and if
+that fails try treat it as a path and read it from a file.
+
+An example configuration file is provided in `docs/example_config.json`.
 
 ## Image and object names
 
@@ -172,106 +203,3 @@ Allowed options:
 ```
 
 Other tools live in the `tools` subdirectory - see the README there for more details.
-
-## Usage
-
-### Running SPDK target
-
-You might need to enable hugepages:
-```
-sudo sh -c 'echo 4096 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages'
-```
-
-Now we start the target, with or without `LD_PRELOAD`, potentially under the debugger. Run `spdk_tgt --help` for more options - in particular, the RPC socket defaults to `/var/tmp/spdk.sock` but a different one can be specified, which might allow running multiple instances of SPDK. Also the roc command has a `—help` option, which is about 500 lines long.
-
-```
-SPDK=/mnt/nvme/ceph-nvmeof/spdk
-sudo LD_PRELOAD=$PWD/liblsvd.so $SPDK/build/bin/spdk_tgt
-```
-
-Here's a simple setup - the first two steps are handled in the ceph-nvmeof python code, and it may be worth looking through the code more to see what options they use.
-
-```
-sudo $SPDK/scripts/rpc.py nvmf_create_transport -t TCP -u 16384 -m 8 -c 8192
-sudo $SPDK/scripts/rpc.py bdev_rbd_register_cluster rbd_cluster
-sudo $SPDK/scripts/rpc.py bdev_rbd_create rbd rbd/fio-target 4096 -c rbd_cluster
-sudo $SPDK/scripts/rpc.py nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -d SPDK_Controller1
-sudo $SPDK/scripts/rpc.py nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 Ceph0
-sudo $SPDK/scripts/rpc.py nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t tcp -a 10.1.0.8 -s 5001
-```
-
-Note also that you can create a ramdisk test, by (1) creating a ramdisk with brd, and (2) creating another bdev / namespace with `bdev_aio_create`. With the version of SPDK I have, it does 4KB random read/write at about 100K IOPS, or at least it did, a month or two ago, on the HP machines.
-
-Finally, I’m not totally convinced that the options I used are the best ones - the -u/-m/-c options for `create_transport` were blindly copied from a doc page. I’m a little more convinced that specifying a 4KB block size in `dev_rbd_create` is a good idea.
-
-## Tests
-
-There are two tests included: `lsvd_rnd_test` and `lsvd_crash_test`. 
-They do random writes of various sizes, with random data, and each 512-byte sector is "stamped" with its LBA and a sequence number for the write.
-CRCs are saved for each sector, and after a bunch of writes we read everything back and verify that the CRCs match.
-
-### `lsvd_rnd_test`
-
-```
-build$ bin/lsvd_rnd_test --help
-Usage: lsvd_rnd_test [OPTION...] RUNS
-
-  -c, --close                close and re-open
-  -d, --cache-dir=DIR        cache directory
-  -D, --delay                add random backend delays
-  -k, --keep                 keep data between tests
-  -l, --len=N                run length
-  -O, --rados                use RADOS
-  -p, --prefix=PREFIX        object prefix
-  -r, --reads=FRAC           fraction reads (0.0-1.0)
-  -R, --reverse              reverse NVMe completion order
-  -s, --seed=S               use this seed (one run)
-  -v, --verbose              print LBAs and CRCs
-  -w, --window=W             write window
-  -x, --existing             don't delete existing cache
-  -z, --size=S               volume size (e.g. 1G, 100M)
-  -Z, --cache-size=N         cache size (K/M/G)
-  -?, --help                 Give this help list
-      --usage                Give a short usage message
-```
-
-Unlike the normal library, it defaults to storing objects on the filesystem; the image name is just the path to the superblock object (the --prefix argument), and other objects live in the same directory.
-If you use this, you probably want to use the `--delay` flag, to have object read/write requests subject to random delays.
-It creates a volume of --size bytes, does --len random writes of random lengths, and then reads it all back and checks CRCs. 
-It can do multiple runs; if you don't specify --keep it will delete and recreate the volume between runs. 
-The --close flag causes it to close and re-open the image between runs; otherwise it stays open.
-
-### `lsvd_rnd_test`
-
-This is pretty similar, except that does the writes in a subprocess which kills itself with `_exit` rather than finishing gracefully, and it has an option to delete the cache before restarting.
-
-This one needs to be run with the file backend, because some of the test options crash the writer, recover the image to read and verify it, then restore it back to its crashed state before starting the writer up again.
-
-It uses the write sequence numbers to figure out which writes made it to disk before the crash, scanning all the sectors to find the highest sequence number stamp, then it veries that the image matches what you would get if you apply all writes up to and including that sequence number.
-
-```
-build$ bin/lsvd_crash_test --help
-Usage: lsvd_crash_test [OPTION...] RUNS
-
-  -2, --seed2                seed-generating seed
-  -d, --cache-dir=DIR        cache directory
-  -D, --delay                add random backend delays
-  -k, --keep                 keep data between tests
-  -l, --len=N                run length
-  -L, --lose-writes=N        delete some of last N cache writes
-  -n, --no-wipe              don't clear image between runs
-  -o, --lose-objs=N          delete some of last N objects
-  -p, --prefix=PREFIX        object prefix
-  -r, --reads=FRAC           fraction reads (0.0-1.0)
-  -R, --reverse              reverse NVMe completion order
-  -s, --seed=S               use this seed (one run)
-  -S, --sleep                child sleeps for debug attach
-  -v, --verbose              print LBAs and CRCs
-  -w, --window=W             write window
-  -W, --wipe-cache           delete cache on restart
-  -x, --existing             don't delete existing cache
-  -z, --size=S               volume size (e.g. 1G, 100M)
-  -Z, --cache-size=N         cache size (K/M/G)
-  -?, --help                 Give this help list
-      --usage                Give a short usage message
-```
diff --git a/docs/configuration.md b/docs/configuration.md
diff --git a/docs/install.md b/docs/install.md