Releases: RSE-Cambridge/data-acc
Fixes for large buffers.
Very large buffers eventually don't request an MDT for every OST, which caused a failure.
sha256:576717be2abc9b37dd1f07f30320f42c9c2ce35d1b929c446dc0348759136556
Extra config to aid testing
Added support for DAC_MAX_MDT_COUNT and DAC_MDT_SIZE_MB config.
sha256:beed4ab5ee72f68b244c4e5fd3d32584c7c5d0efcf165c7077ba7d468c2b8efb
Move from Partitions to LVM
Includes replacing the setting of DAC_MDT_SIZE with DAC_MDT_SIZE_GB, i.e. drop any units in your old config and change it to just an int amount of GB.
sha256 26671f456294a7d22839a58f3bf979617b084d604d3e931aec998d87abd136f6
Wait for umount to succeed
Fixes to remove lazy umount, so we spot umount errors when they happen. Also timeout commands that get hung (for example, ansible runs talking to clients with a dead mount). Rework some of the very old watching logic to use the new channel based code paths.
sha256 21dc9f34952309dfd47306a249d6a5ec7be21ee38f2e73836893c09b0f67c131
Unmount related fixes
Only attempt to mount and umount the attachments we have detected as changed, and also ignore any swapoff related errors, as they are not critical and likely interact with some sites slurm post-amble scripts.
sha256 a42a11f18c4ec40a347c37c18024a217a1cfa77d8b04faea2df34b70ef1ae83e
Stability Fixes
Hold a lock while doing any volume operation, so we don't have races between the setup and teardown logic as we have seen at the moment.
sha256 a21aef60b907de200a1a3dfe8ed14d014aec2ef6fb47ef4cca92a42580d9a88c
Every OST has matching MDT
First attempt at giving every OST its own matching MDT by partitioning the assigned device.
NOTE: to adopt this release you need to rebuild your full dac environment, because fs-ansible is not backwards compatible with old releases.
sha256 c59fffada05f8dd13dd3797c96e84c7cf9259687284c9970d8306178dc9a347f
setup allows nodehostnamefile
Turns out setup can be passed nodehostnamefile. It is just accepted and ignored for now.
sha256 0cca50ad72a4064cf7be24aa7d562bdd3b96d6da081b628782623539ea5e939a
dacd watch fixes
Re-worked how dacd watches for new volumes and watches volumes once it seems them appear. Extra debug info added to help track down when we seem to lose events. Includes calling each instance of processNewPrimaryBlock inside a new goroutine.
sha256 ba7b956cbef2b5143294e27b93dc521a784e156778780ceccfcdb880b73fe0f6
Very experimental stage_in and stage_out support
Try using rsync as the buffer user to stage in and out. Needs lots of hardening, and only tries to support $DW_JOB_STRIPED mode buffer.
To work around any stage_in and stage_out related failures, remove #DW stage_in and stage_out from your submission scripts.
sha256 ecc6fef8eafcb13208817be810b4b09ab12f44a3d755a52ad608fb898bfb249b