Skip to content

Commit bec0ade

Browse files
committed
Update setup documentation and Docker images
1 parent 1fbe518 commit bec0ade

File tree

8 files changed

+231
-240
lines changed

8 files changed

+231
-240
lines changed

.devcontainer/Dockerfile

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -38,14 +38,19 @@ RUN ln -sfn /usr/bin/python${PYTHON_VERSION} /usr/bin/python3 & \
3838
RUN python3 -m venv $POETRY_VENV \
3939
&& $POETRY_VENV/bin/pip install -U pip setuptools \
4040
&& $POETRY_VENV/bin/pip install poetry==${POETRY_VERSION}
41-
# Add `poetry` to PATH
41+
# Add `poetry` to PATH and configure
4242
ENV PATH="${PATH}:${POETRY_VENV}/bin"
43-
# Install AWS CLI
44-
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
45-
unzip awscliv2.zip && \
46-
./aws/install && \
47-
rm awscliv2.zip
48-
RUN rm -rf /var/lib/apt/lists/*
4943
RUN poetry config virtualenvs.create true && \
5044
poetry config virtualenvs.in-project true
45+
# Clean up
46+
RUN rm -rf /var/lib/apt/lists/*
47+
# Create caches
48+
RUN mkdir -p /root/.cache/silnlp/experiments
49+
RUN mkdir /root/.cache/silnlp/projects
50+
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
51+
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects
52+
# Set environment variables
53+
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
54+
ENV SIL_NLP_DATA_PATH=/aqua-ml-data
55+
ENV AWS_REGION="us-east-1"
5156
CMD ["bash"]

.devcontainer/devcontainer.json

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@
1717
"/home/clearml/.clearml/hf-cache:/root/.cache/huggingface"
1818
],
1919
"containerEnv": {
20-
"AWS_REGION": "${localEnv:AWS_REGION}",
2120
"AWS_ACCESS_KEY_ID": "${localEnv:AWS_ACCESS_KEY_ID}",
2221
"AWS_SECRET_ACCESS_KEY": "${localEnv:AWS_SECRET_ACCESS_KEY}",
2322
"CLEARML_API_ACCESS_KEY": "${localEnv:CLEARML_API_ACCESS_KEY}",
@@ -44,7 +43,10 @@
4443
},
4544
"editor.formatOnSave": true,
4645
"editor.formatOnType": true,
47-
"isort.args":["--profile", "black"]
46+
"isort.args": [
47+
"--profile",
48+
"black"
49+
]
4850
},
4951
// Add the IDs of extensions you want installed when the container is created.
5052
"extensions": [

Dockerfile

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -105,11 +105,16 @@ RUN mv meteor-1.5/meteor-1.5.jar /usr/local/bin
105105
RUN rm -rf meteor-1.5
106106
ENV METEOR_PATH=/usr/local/bin
107107

108+
# Create caches
109+
RUN mkdir -p .cache/silnlp/experiments
110+
RUN mkdir .cache/silnlp/projects
111+
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp/experiments
112+
ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects
113+
108114
# Other environment variables
109115
ENV SIL_NLP_DATA_PATH=/aqua-ml-data
110-
RUN mkdir -p .cache/silnlp
111-
ENV SIL_NLP_CACHE_EXPERIMENT_DIR=/root/.cache/silnlp
112116
ENV CLEARML_API_HOST="https://api.sil.hosted.allegro.ai"
117+
ENV AWS_REGION="us-east-1"
113118

114119
# Clone silnlp and make it the starting directory
115120
RUN git clone https://github.com/sillsdev/silnlp.git

README.md

Lines changed: 44 additions & 178 deletions
Large diffs are not rendered by default.

clear_ml_linux_setup.md renamed to clear_ml_setup.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,7 @@
1-
# Instructions for setting up Clear-ML on Linux.
2-
3-
These were tested on Pop!_OS.
4-
See [Clear-ML Windows setup](clear_ml_windows_setup.md) for instructions to set up Clear-ML on Windows.
1+
# Instructions for setting up Clear-ML.
52

63
## Install the clearml python package.
7-
Open a terminal and use pip to install Clear-ML.
4+
Open a terminal (or Command Prompt on Windows) and use pip to install Clear-ML.
85
`pip install clearml`
96

107
## Add your AWS storage vault credentials (If using AWS S3).

clear_ml_windows_setup.md

Lines changed: 0 additions & 46 deletions
This file was deleted.

manual_setup.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Manual Setup
2+
3+
## SILNLP Prerequisites
4+
These are the main requirements for the SILNLP code to run on a local machine. The SILNLP repo itself is hosted on Github, mainly written in Python and calls SIL.Machine.Tool. 'Machine' as we tend to call it, is a .NET application that has many functions for manipulating USFM data. Most of the language data we have for low resource languages in USFM format. Since Machine is a .Net application it depends upon the __.NET core SDK__ which works on Windows and Linux. Since there are many python packages that need to be used, with complex versioning requirements we use a Python package called Poetry to mangage all of those. So here is a rough heirarchy of SILNLP with the major dependencies.
5+
6+
| Requirement | Reason |
7+
| --------------------- | ----------------------------------------------------------------- |
8+
| GIT | to get the repo from [github](https://github.com/sillsdev/silnlp) |
9+
| Python | to run the silnlp code |
10+
| Poetry | to manage all the Python packages and versions |
11+
| SIL.Machine.Tool | to support many functions for data manipulation |
12+
| .Net core SDK | Required by SIL.Machine.Tool |
13+
| NVIDIA GPU | Required to run on a local machine |
14+
| Nvidia drivers | Required for the GPU |
15+
| CUDA Toolkit | Required for the Machine learning with the GPU |
16+
| Environment variables | To tell SILNLP where to find the data, etc. |
17+
18+
## Setup
19+
20+
The SILNLP code can be run on either Windows or Linux operating systems. If using an Ubuntu distribution, the only compatible version is 20.04.
21+
22+
__Download and install__ the following before creating any projects or starting any code, preferably in this order to avoid most warnings:
23+
24+
1. If using a local GPU: [NVIDIA driver](https://www.nvidia.com/download/index.aspx)
25+
* On Ubuntu, the driver can alternatively be installed through the GUI by opening Software & Updates, navigating to Additional Drivers in the top menu, and selecting the newest NVIDIA driver with the labels proprietary and tested.
26+
* After installing the driver, reboot your system.
27+
2. [Git](https://git-scm.com/downloads)
28+
3. [Python 3.8](https://www.python.org/downloads/) (latest minor version, ie 3.8.19)
29+
* Can alternatively install Python using [miniconda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html) if you're planning to use more than one version of Python. If following this method, activate your conda environment before installing Poetry.
30+
4. [Poetry](https://python-poetry.org/docs/#installation)
31+
* Note that whether the command should call python or python3 depends on which is required on your machine.
32+
* It may (or may not) be possible to run the curl command within a VS Code terminal. If that causes permission errors close VS Code and try it in an elevated CMD prompt.
33+
34+
Windows:
35+
At an administrator CMD prompt or a terminal within VS Code run:
36+
```
37+
curl -sSL https://install.python-poetry.org | python - --version 1.7.1
38+
```
39+
In Powershell, run:
40+
```
41+
(Invoke-WebRequest -Uri https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py -UseBasicParsing).Content | python
42+
```
43+
44+
Linux:
45+
In terminal, run:
46+
```
47+
curl -sSL https://install.python-poetry.org | python3 - --version 1.7.1
48+
```
49+
Add the following line to your .bashrc file in your home directory:
50+
```
51+
export PATH="$HOME/.local/bin:$PATH"
52+
```
53+
5. .NET Core SDK
54+
* The necessary versions are 7.0 and 3.1. If your machine is only able to install version 7.0, you can set the DOTNET_ROLL_FORWARD environment variable to "LatestMajor", which will allow you to run anything that depends on dotnet 3.1.
55+
* Note - the .NET SDK is needed for [SIL.Machine.Tool](https://github.com/sillsdev/machine). Many of the scripts in this repo require this .Net package. The .Net package will be installed and updated when the silnlp is initialized in `__init__.py`.
56+
* Windows: [.NET Core SDK](https://dotnet.microsoft.com/download)
57+
* Linux: Installation instructions can be found [here](https://learn.microsoft.com/en-us/dotnet/core/install/linux-ubuntu-2004)
58+
6. C++ Redistributable
59+
* Note - this may already be installed. If it is not installed you may get cryptic errors such as "System.DllNotFoundException: Unable to load DLL 'thot' or one of its dependencies"
60+
* Windows: Download from https://support.microsoft.com/en-us/topic/the-latest-supported-visual-c-downloads-2647da03-1eea-4433-9aff-95f26a218cc0 and install
61+
* Linux: Instead of installing the redistributable, run the following commands:
62+
```
63+
sudo apt-get update
64+
sudo apt-get install build-essential gdb
65+
```
66+
67+
### Visual Studio Code setup
68+
69+
1. Install Visual Studio Code
70+
2. Install Python extension for VS Code
71+
3. Open up silnlp folder in VSC
72+
4. In CMD window, type `poetry install` to create the virtual environment for silnlp
73+
* If using conda, activate your conda environment first before `poetry install`. Poetry will then install all the dependencies into the conda environment.
74+
5. Choose the newly created virtual environment as the "Python Interpreter" in the command palette (ctrl+shift+P)
75+
* If using conda, choose the conda environment as the interpreter
76+
6. Open the command palette and select "Preferences: Open User Settings (JSON)". In the `settings.json` file, add the following options:
77+
``` json
78+
"python.formatting.provider": "black",
79+
"python.linting.pylintEnabled": true,
80+
"editor.formatOnSave": true,
81+
```
82+
83+
### S3 bucket setup
84+
85+
See [S3 bucket setup](s3_bucket_setup.md).
86+
87+
### ClearML setup
88+
89+
See [ClearML setup](clear_ml_setup.md).
90+
91+
### Additional Environment Variables
92+
Set the following environment variables with your respective credentials: CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY
93+
* Windows users: see [here](https://github.com/sillsdev/silnlp/wiki/Install-silnlp-on-Windows-10#permanently-set-environment-variables) for instructions on setting environment variables permanently
94+
* Linux users: To set environment variables permanently, add each variable as a new line to the `.bashrc` file in your home directory with the format
95+
```
96+
export VAR="VAL"
97+
```
98+
99+
### Setting Up and Running Experiments
100+
101+
See the [wiki](https://github.com/sillsdev/silnlp/wiki) for information on setting up and running experiments. The most important pages for getting started are the ones on [file structure](https://github.com/sillsdev/silnlp/wiki/Folder-structure-and-file-naming-conventions), [model configuration](https://github.com/sillsdev/silnlp/wiki/Configure-a-model), and [running experiments](https://github.com/sillsdev/silnlp/wiki/NMT:-Usage). A lot of the instructions are specific to NMT, but are still helpful starting points for doing other things like [alignment](https://github.com/sillsdev/silnlp/wiki/Alignment:-Usage).
102+
103+
See [this](https://github.com/sillsdev/silnlp/wiki/Using-the-Python-Debugger) page for information on using the VS code debugger.
104+
105+
If you need to use a tool that is supported by SILNLP but is not installable as a Python library (which is probably the case if you get an error like "RuntimeError: eflomal is not installed."), follow the appropriate instructions [here](https://github.com/sillsdev/silnlp/wiki/Installing-External-Libraries).

s3_bucket_setup.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# S3 bucket setup
2+
3+
We use Amazon S3 storage for storing our experiment data. Here is some workspace setup to enable a decent workflow.
4+
5+
### Install and configure AWS S3 storage
6+
The following will allow the boto3 and S3Path libraries in Python correctly talk to the S3 bucket.
7+
* Install the aws-cli from: https://aws.amazon.com/cli/
8+
* In cmd, type: `aws configure` and enter your AWS access_key_id and secret_access_key and the region (we use region = us-east-1).
9+
* The aws configure command will create a folder in your home directory named '.aws' it should contain two plain text files named 'config' and 'credentials'. The config file should contain the region and the credentials file should contain your access_key_id and your secret_access_key.
10+
(Home directory on windows is usually C:\Users\<Username>\ and on linux it is /home/username)
11+
12+
### Install and configure rclone
13+
14+
15+
**Windows**
16+
17+
The following will mount /aqua-ml-data on your S drive and allow you to explore, read and write.
18+
* Install WinFsp: http://www.secfs.net/winfsp/rel/ (Click the button to "Download WinFsp Installer" not the "SSHFS-Win (x64)" installer)
19+
* Download rclone from: https://rclone.org/downloads/
20+
* Unzip to your desktop (or some convient location).
21+
* Add the folder that contains rclone.exe to your PATH environment variable.
22+
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~\AppData\Roaming\rclone` (creating folders if necessary)
23+
* Add your credentials in the appropriate fields in `~\AppData\Roaming\rclone`
24+
* Take the `scripts/rclone/mount_to_s.bat` file from this SILNLP repo and copy it to the folder that contains the unzipped rclone.
25+
* Double-click the bat file. A command window should open and remain open. You should see something like:
26+
```
27+
C:\Users\David\Software\rclone>call rclone mount --vfs-cache-mode full --use-server-modtime s3aqua:aqua-ml-data S:
28+
The service rclone has been started.
29+
```
30+
31+
**Linux**
32+
33+
The following will mount /aqua-ml-data to an S folder in your home directory and allow you to explore, read and write.
34+
* Download rclone from: https://rclone.org/install/
35+
* Take the `scripts/rclone/rclone.conf` file from this SILNLP repo and copy it to `~/.config/rclone/rclone.conf` (creating folders if necessary)
36+
* Add your credentials in the appropriate fields in `~/.config/rclone/rclone.conf`
37+
* Create a folder called "S" in your user directory
38+
* Run the following command:
39+
```
40+
rclone mount --vfs-cache-mode full --use-server-modtime s3aqua:aqua-ml-data ~/S
41+
```
42+
### To start S: drive on start up
43+
44+
**Windows**
45+
46+
Put a shortcut to the mount_to_s.bat file in the Startup folder.
47+
* In Windows Explorer put `shell:startup` in the address bar or open `C:\Users\<Username>\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup`
48+
* Right click to add a new shortcut. Choose `mount_to_s.bat` as the target, you can leave the name as the default.
49+
50+
Now your AWS S3 bucket should be mounted as S: drive when you start Windows.
51+
52+
**Linux**
53+
* Run `crontab -e`
54+
* Paste `@reboot rclone mount --vfs-cache-mode full --use-server-modtime s3aqua:aqua-ml-data ~/S` into the file, save and exit
55+
* Reboot Linux
56+
57+
Now your AWS S3 bucket should be mounted as ~/S when you start Linux.

0 commit comments

Comments
 (0)