Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
So far, after a full reboot of all machines, some would sometimes have failed systemd units. Key changes: * A mount-only machine is added to test that this use case works. This made me find all the below troubles. * Fix SSH hang by using .mount unit instead of fstab converter. This apparently works around NixOS/nixpkgs#30348 for me. No idea why the fstab converter would have this problem. The nasty pam_systemd(sshd:session): Failed to create session: Connection timed out error would slow down SSH logins by 25 seconds, also making reboots slower (because nixops keys upload uses SSH). It would also show things like `session-1.scope` as failed in systemctl. * More robustly track (via Consul) whether the Gluster volume is already mountable from the client (that is, up and running on the servers). This has come a long way; to implement this, I've tried now * manual sessions, but those have 10 second min TTL which gets auto-extended even longer when rebooting, so I tried * script checks, which don't kill the subprocess even when you give a `timeout` and don't allow to set a TTL, so I tried * TTL checks + manual update script, and not even those set the check to failed when the TTL expires See my filed Consul bugs: * hashicorp/consul#3569 * hashicorp/consul#3563 * hashicorp/consul#3565 So I am using a more specific workaround now: A TTL check + manual update script, AND a script (`consul-scripting-helper.py waitUntilService --wait-for-index-change`) run by a service (`glusterReadyForClientMount.service`) that waits until the TTL of a check for the service is observed to be bumped at least once during the life-time of the script. When the script observes a TTL bump, we can be sure that at least one of the gluster servers has its volume up. * `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used to check whether the volume is actually up. * Using consul's DNS feature to automatically pick an available server for the mount. dnsmasq is used to forward DNS queries to the *.consul domain to the consul agent. `allow_stale = false` is used to ensure that the DNS queries are not outdated. * Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237). * `consul-scripting-helper.py` received some fixes and extra loops to retry when Consul is down. This commit also switches to using `services.glusterfs.tlsSettings` as implemented in NixOS/nixpkgs#27340 which revealed a lot of the above issues.
- Loading branch information