Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Update config files for g4 #20

Merged
merged 18 commits into from
May 12, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 63 additions & 1 deletion tools/jenkins-slave-creation-unix/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,66 @@
<!--- specific language governing permissions and limitations -->
<!--- under the License. -->

This Terraform setup will spawn an instance that is ready to be saved into an AMI to create a Jenkins slave.
This Terraform setup will spawn an instance that is ready to be saved into an AMI to create a Jenkins slave.

# Steps
## Setup Terraform
### Fetch Terraform and unzip the binary

```
wget https://releases.hashicorp.com/terraform/0.12.24/terraform_0.12.24_linux_amd64.zip
sudo apt install unzip
unzip terraform_0.12.24_linux_amd64.zip
```

### Add to path
Add the binary to the environment variable 'PATH'.
For example

```
sudo mv terraform /usr/local/bin/
mkdir /home/ubuntu/bin
mv /usr/local/bin/terraform /home/ubuntu/bin/terraform
```

### Verify
Check whether the terraform binary is in the PATH variable

```
echo $PATH
```

Verify terraform is properly installed

```
$ terraform --version
Terraform v0.12.24
$ which terraform
/home/ubuntu/bin/terraform
```

## Python package requirements
Install the terraform python package

```
pip3 install python_terraform
```

## Fill the redacted information
- infrastructure.tf [Security groups]
- infrastructure.tfvars [`key_name`, `key_path`, `secret_manager_docker_hub_arn`]
- `~/.aws/config` [Isengard account profile]

## Run the AMI creation script

```
./create_slave.sh
```

- Enter the desired directory

## Create an AMI
- Login to AWS Console
- Instance would be created with the name used in `infrastructure.tfvars.instance_name`
- Wait for the instance till it's state is "Stopped". [Note : Don't manually stop the instance. Manually stopping the instance can cause the AMI to get corrupted. In case it doesn't change state to stop, there was likely an issue in AMI generation. Please refer /var/log/cloud-init-output.log for further debug]
- Once the instance is stopped, Select Instance -> Actions -> Image -> Create Image

This file was deleted.

This file was deleted.

This file was deleted.

134 changes: 0 additions & 134 deletions tools/jenkins-slave-creation-unix/conf-ubuntu-gpu-p3/install.sh

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,14 @@

key_name = "REDACTED"
key_path = "~/.ssh/REDACTED"
instance_type = "g3.8xlarge"
instance_type = "g4dn.4xlarge"

s3_config_bucket = "mxnet-ci-slave-dev"
s3_config_filename = "ubuntu-gpu-g3-config.tar.bz2"
slave_install_script = "conf-ubuntu-gpu-g3/install.sh"
shell_variables_file = "conf-ubuntu-gpu-g3/shell-variables.sh"
ami = "ami-bd8f33c5" # ftp://64.50.236.216/pub/ubuntu-cloud-images/query/xenial/server/released.txt
instance_name = "Slave-base_Ubuntu-GPU-G3"
s3_config_filename = "ubuntu-gpu-config.tar.bz2"
slave_install_script = "conf-ubuntu-gpu/install.sh"
shell_variables_file = "conf-ubuntu-gpu/shell-variables.sh"
# Base AMI, defines the OS of the slave instance [here: Ubuntu18.04 base image]
ami = "ami-0d1cd67c26f5fca19" # Ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200112
instance_name = "Slave-base_Ubuntu-GPU"
aws_region = "us-west-2"
secret_manager_docker_hub_arn = "arn:aws:secretsmanager:us-west-2:REDACTED:secret:REDACTED"
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,15 @@ sudo pip3 install boto3 python-jenkins joblib docker
echo "Installed htop, java, git and python"

#Install nvidia drivers
sudo apt-get -y install nvidia-418
#Chose the latest nvidia driver supported on Tesla driver for Ubuntu18.04
#Refer : https://www.nvidia.com/Download/driverResults.aspx/158191/en-us
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb
ChaiBapchya marked this conversation as resolved.
Show resolved Hide resolved
sudo dpkg -i cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-2-local-10.2.89-440.33.01/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda-drivers

# TODO: - Disabled nvidia updates @ /etc/apt/apt.conf.d/50unattended-upgrades
#Unattended-Upgrade::Package-Blacklist {
Expand All @@ -79,7 +87,12 @@ sudo apt-get install -y docker-ce
sudo usermod -aG docker jenkins_slave
sudo systemctl enable docker #Enable docker to start on startup
sudo service docker restart
echo "Installed docker engine"
# Get latest docker-compose; Ubuntu 18.04 has latest docker in bionic-updates, but not docker-compose and rather ships v1.17 from 2017
# See https://github.com/docker/compose/releases for latest release
# /usr/local/bin is not on the PATH in Jenkins, thus place binary in /usr/bin
sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/bin/docker-compose
sudo chmod +x /usr/bin/docker-compose
echo "Installed docker engine and docker-compose"

# Add nvidia-docker and nvidia-docker-plugin
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
Expand All @@ -89,9 +102,23 @@ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.li
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
# Install nvidia docker related packages and reload the Docker daemon configuration
# Install nvidia-container toolkit and reload the Docker daemon configuration
# Refer Nvidia Docker : https://github.com/NVIDIA/nvidia-docker
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Install & add nvidia container runtime to the Docker daemon
# Refer https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup
sudo apt-get install nvidia-container-runtime
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/override.conf <<EOF
[Service]
ExecStart=
ChaiBapchya marked this conversation as resolved.
Show resolved Hide resolved
ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker

# Download additional scripts
sudo apt-get -y install awscli
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@


export S3_CONFIG_BUCKET="mxnet-ci-slave-dev"
export S3_CONFIG_FILE="ubuntu-gpu-p3-config.tar.bz2"
export S3_CONFIG_FILE="ubuntu-gpu-config.tar.bz2"