Skip to content
This repository has been archived by the owner on Mar 30, 2023. It is now read-only.

Commit

Permalink
Merge pull request #68 from RSE-Cambridge/update-lustre
Browse files Browse the repository at this point in the history
Fix up dac-ansible
  • Loading branch information
JohnGarbutt authored Jul 10, 2019
2 parents ee3553c + 497cbfc commit 01671ed
Show file tree
Hide file tree
Showing 4 changed files with 95 additions and 6 deletions.
12 changes: 12 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,15 @@ jobs:

- store_artifacts:
path: ~/data-acc/bin

workflows:
version: 2
regular-build:
jobs:
- build
tagged-build:
jobs:
- build:
filters:
tags:
only: /^v.*/
77 changes: 77 additions & 0 deletions dac-ansible/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ To run this set of playbooks, please execute:
./create-servers.py > hosts
ansible-playbook master.yml -i hosts

Note the above pulls the docker image johngarbutt/data-acc which can be
pushed by doing something like this:

cd ../docker-slurm
./build.sh
docker-compose push

## Install notes

You may find this useful to run the above ansible-playbook command:
Expand All @@ -17,3 +24,73 @@ You may find this useful to run the above ansible-playbook command:
pip install -U pip
pip install -U ansible openstacksdk
ansible-galaxy install -r requirements.yml

## Debugging Guide

Once the ansible has finished, you can login and try a slurm test:

ssh centos@<ip-of-slurm-master>
docker exec -it slurmctld bash
scontrol show burstbuffer
cd /usr/local/bin/data-acc/tools/
. slurm-test.sh

### dac-slurm-master

Slurm master makes calls to dacctl via the datawarp burst buffer
plugin. This only really talks to etcd.

For slurmctld you can see the logs here:

ssh centos@<ip-of-slurm-master>
docker logs slurmctld

You can see the dacctl logs here:

ssh centos@<ip-of-slurm-master>
docker exec -it slurmctld bash
less /var/log/dacctl.log

When you have a buffer that needs to be teared down after fixing
what may have blocked any previous attempts (such as a bad sudoers files)
you can try:

ssh centos@<ip-of-slurm-master>
docker exec -it slurmctld bash
/usr/local/bin/dacctl teardown --token <job-id>

Note the above tends to leave client mounts behind, which need to be cleared
manually via "umount -l <directory>" on slurm-cpu[1-2].

### dac[1-3]

The dacd processes are listening to etcd waiting for commands from
dacctl via etcd. If they have the 0th brick, then they run ansible
over all the dacd nodes to create the filesystem, then they run ssh
to each of the compute nodes to mount the filesystem.

On the dacd nodes, you can find out lots from journalctld:

ssh centos@<ip-of-dacd-node>
journalctl -u dacd

You can also inspect the current state of data-acc by looking in etcd:

ssh centos@<ip-of-dacd-node>
sudo su dac /usr/local/bin/data-acc-v0.6/tools/etcd-ls.sh

You can check ssh access from dac by doing:

ssh centos@<ip-of-dacd-node>
ssh dac@dac1 date
ssh dac@dac2 date
ssh dac@dac3 date
ssh dac@slurm-cpu1 date
ssh dac@slurm-cpu2 date

### slurm-cpu[1-2]

Mostly watching for ssh from dacd that mounts lustre.

"mount" can give the current state of things, also with looking at
dmesg for lustre message.
8 changes: 4 additions & 4 deletions dac-ansible/master.yml
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@
- hosts: dac_workers:slurm_workers
become: True
vars:
lustre_release: "2.12.0"
lustre_release: "2.12.2"
tasks:
- name: enable lustre server repo
yum_repository:
Expand All @@ -139,7 +139,7 @@
- hosts: dac_workers:slurm_workers
become: True
vars:
lustre_release: "2.12.0"
lustre_release: "2.12.2"
tasks:
- name: Install Lustre Server
yum:
Expand All @@ -153,7 +153,7 @@
- hosts: dac_workers:slurm_workers
become: True
vars:
lustre_release: "2.12.0"
lustre_release: "2.12.2"
tasks:
- name: Install Lustre Client
yum:
Expand Down Expand Up @@ -245,7 +245,7 @@
- name: Ensure passwordless sudo for dac user
lineinfile:
path: /etc/sudoers.d/80-dac
line: "dac ALL=(ALL) NOPASSWD: /usr/bin/mkdir -p /dac/*, /usr/bin/chmod 770 /dac/*, /usr/bin/chmod 0600 /dac/*, /usr/bin/chown * /dac/*, /usr/bin/mount -t lustre * /dac/*, /usr/bin/umount -l /dac/*, /usr/sbin/losetup /dev/loop* /dac/*, /usr/sbin/losetup -d /dev/loop*, /usr/sbin/mkswap /dev/loop*, /usr/sbin/swapon /dev/loop*, /usr/sbin/swapoff /dev/loop*, /usr/bin/ln -s /dac/* /dac/*, /usr/bin/dd if=/dev/zero of=/dac/*, /usr/bin/rm -rf /dac/*, /bin/grep /dac/* /etc/mtab"
line: "dac ALL=(ALL) NOPASSWD: /usr/bin/mkdir -p /dac/*, /usr/bin/chmod 770 /dac/*, /usr/bin/chmod 0600 /dac/*, /usr/bin/chown * /dac/*, /usr/bin/mount -t lustre * /dac/*, /usr/bin/umount /dac/*, /usr/sbin/losetup /dev/loop* /dac/*, /usr/sbin/losetup -d /dev/loop*, /usr/sbin/mkswap /dev/loop*, /usr/sbin/swapon /dev/loop*, /usr/sbin/swapoff /dev/loop*, /usr/bin/ln -s /dac/* /dac/*, /usr/bin/dd if=/dev/zero of=/dac/*, /usr/bin/rm -rf /dac/*, /bin/grep /dac/* /etc/mtab"
regexp: "^dac.*$"
create: yes
state: present
Expand Down
4 changes: 2 additions & 2 deletions dac-ansible/roles/data-acc/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ data_acc_launch: True
data_acc_name: 'data-acc-{{data_acc_version}}'
data_acc_tgz: '{{data_acc_name}}.tgz'
#data_acc_tgz_url: '{{data_acc_mirror}}/{{data_acc_version}}/{{data_acc_tgz}}'
data_acc_tgz_url: 'https://github.com/RSE-Cambridge/data-acc/releases/download/v0.18/data-acc-v0.18.tgz'
data_acc_checksum: 'sha256:beed4ab5ee72f68b244c4e5fd3d32584c7c5d0efcf165c7077ba7d468c2b8efb'
data_acc_tgz_url: 'https://github.com/RSE-Cambridge/data-acc/releases/download/v1.1/data-acc-v1.1.tgz'
data_acc_checksum: 'sha256:a88e66046f5d662582d85c6ce5eba56e7eec2e2b1d690206e7181f6e862f39f7'

data_acc_user: dac
data_acc_group: dac
Expand Down

0 comments on commit 01671ed

Please sign in to comment.