Skip to content

Commit

Permalink
Improve robustness when rebooting.
Browse files Browse the repository at this point in the history
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.
  • Loading branch information
nh2 committed Oct 22, 2017
1 parent 91b6be3 commit f93e289
Show file tree
Hide file tree
Showing 8 changed files with 697 additions and 125 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ This demonstrates an advanced [nixops](https://nixos.org/nixops/) deployment of
* the [GlusterFS](https://www.gluster.org/) distributed file system
* a replicated setup across 3 machines in AWS Frankfurt
* read-only [geo-replication](https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Geo%20Replication/) to another gluster cluster (could be on the other side of the world)
* mounting the file system from a client-only node
* use of gluster's [SSL](https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Geo%20Replication/) support for encryption and server/client authentication
* use of [consul](https://www.consul.io/) to orchestrate volume-initialisation across machines on first boot
* the whole thing running over the [tinc](http://tinc-vpn.org/) VPN for security
Expand Down Expand Up @@ -52,7 +53,7 @@ Run

```
nixops create -d gluster-test-deployment '<example-gluster-cluster.nix>'
env NIX_PATH=.:nixpkgs=https://github.com/nh2/nixpkgs/archive/84ecf175.tar.gz nixops deploy -d gluster-test
env NIX_PATH=.:nixpkgs=https://github.com/nh2/nixpkgs/archive/84ecf17.tar.gz nixops deploy -d gluster-test-deployment
```

This should complete without errors and you should have your gluster cluster ready.
Expand Down
Loading

0 comments on commit f93e289

Please sign in to comment.