Repository for B.Sc. Physics and Astronomy project: Improving the Performance of Conservative-to-Primitive Inversion in Relativistic Hydrodynamics Using Artificial Neural Networks. Thesis and presentation can be found in thesis.pdf
and presentation.pdf
in the root directory.
models
contains trained models with their net.pth
, optimizer.pth
, scheduler.pth
, net.pt
, and all other data saved in csv and json files. The directories also contain their own local copy of the scripts in which the hyperparameters, subparameters and file output names are set to correspond with the model in question. These local scripts provide the models in their states as they were generated for the thesis and are outdated states of the scripts SRHD_ML.ipynb
(or SRHD_ML.py
) and GRMHD_ML.ipynb
(or GRMHD_ML.py
) that are found in the src
directory. They are outdated in having bugs that are fixed later on (see addendum/
) and in having outdated comments.
src
is the directory in which one can experiment with creating new models. It has the most up-to-date version of the scripts for SRHD and GRMHD. The SRHD script is itself an outdated version of the GRMHD script; it can continue to be used independently of the GRMHD script, but it has more bugs than the GRMHD script. A listing of commit messages between the two from the original older repository of the project can be found in addendum/commit_messages_SRHD_to_GRMHD.txt
. For continuation of the project, we advise to just continue to edit the script GRMHD_ML.ipynb
(or GRMHD_ML.py
) and keep track of significant states of the script, e.g. optimizing with such and such model with such and such parameters, in some other way, and to implement code to easily load different states quickly.
C++ source code files are located in the cpp
directories.
-
Clone the repository to the desired location.
-
Create a virtual environment in conda or python venv if desired.
- Run
pip install -r requirements.txt
Make sure torch is uncommented in the file.
- Follow How to use this notebook at the top of the script in question.
If a GPU is available, one can follow the steps as listed under MMAAMS workstation, Installation for the C++ scripts, but choose a CUDA-enabled distribution of libtorch instead. The rest of the procedure is the same.
-
Open the GitHub repository in Google Colab.
-
Create a copy of the Jupyter notebook file that one wants to run so that one is able to save changes.
-
Continue with Using the scripts on Google Colab.
- Clone the repository to the desired location.
At the time of writing (Fri Jun 23 11:17:37 AM CEST 2023), Anaconda was required to get a more recent python version running on the workstation. The version that was available on the MMAAMS workstation downgraded PyTorch to a version incompatible with the sm_86
architecture of the Nvidia RTX A6000 GPU of the workstation. Anaconda installs a sandboxed newer version of python such that PyTorch is not downgraded in the environment and sm_86
architecture is supported. We have confirmed the scripts to work well with the GPU on python version 3.11.3.
-
Set up a conda virtual environment
conda create -n <env_name> python
To set up with a specific version of python, run the following instead with 3.x.x
replaced by the desired version:
conda create -n wpa python=3.x.x
- Activate the environment
conda activate <env_name>
Note that the environment must be activated every time to run the scripts.
- Install the required packages (source of the command)
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
-
Comment out
torch
inrequirements.txt
to prevent pip installation overriding the PyTorch installation via conda. -
Run
pip install -r requirements.txt
- Run the script for the model in question by following How to use this notebook at the top of the python script.
Running a model in C++ requires libtorch. At the time of writing (Fri Jun 23 11:17:52 AM CEST 2023), we could not get libtorch to work with the sm_86
architecture of the Nvidia RTX A6000 GPU on the workstation, and so we ran it on the CPU only. These are the installation instructions for the latter procedure.
- Download libtorch into the desired directory (see PyTorch documentation for the latest version)
wget https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-2.0.1%2Bcpu.zip
- Unzip the downloaded zip file.
The next step requires the CMakeLists.txt
file to be set up, which we have already done for all scripts. However, if problems are encountered, consult the PyTorch documentation on using torch in C++.
- As found in the How to use this notebook section in the scripts, building can be done with
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch/ ..,
cmake --build . --config release
The executable can then be run with ./<executable name>
.
Using either the SRHD or the GRMHD script on Google Colab is straightforward: open the Jupyter notebook file in Colab and
- Set
drive_folder
to save files to your desired Google Drive directory. - Comment (not uncomment)
%%script echo skipping
of the drive mounting cell. - Comment (not uncomment)
%%script echo skipping
line of thepip install
cell. - The rest is the same as running locally, i.e. as in Using the scripts on a local machine.
- If there is no access to a Jupyter environment, use the
.py
version of the script instead. - Follow How to use this notebook at the top of the script.
- Follow How to use this notebook at the top of the script.
In the Jupyter notebooks, one can load a model without retraining and without optimizing by following Loading an already trained model and Generating the C++ model of How to use this notebook at the top of the script. If Jupyter notebook is not available, one can still follow these instructions, and one should simply explicitly comment out code that should not be run according to these instructions in the .py
file version of the script.
Evaluation of an artificial neural network model can be done with torch.cuda.Event
. This is illustrated at the end of model/NNGR1/NNGR1_evaluation.py
. The relevant code is:
example_input = generate_input_data(*generate_samples(1))
# Ensure that your model and input data are on the same device (GPU in this case)
model = net_loaded.to('cuda')
input_data = example_input.to('cuda')
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
output = model(input_data)
end_event.record()
torch.cuda.synchronize() # Wait for the events to be recorded
print(f"Evaluation time: {start_event.elapsed_time(end_event)} milliseconds")
Some models have scripts ending in _train.py
and _optimization.py
. These are just scripts in which OPTIMIZE
is set to False
and to True
respectively, and, together with other settings and hyperparameters set as can be found in the file, are used to test the training or optimization process of an arbitrary model easily on the workstation without having to open the same file and editing it many times to switch between no optimization and optimization.
Problems can arise from running models that are trained on the GPU on the CPU and vice versa. These problems are solved by mapping the model to the CPU or to the GPU when it is loaded, which can be done without retraining the model. This mapping is most easily done in the python script from which the model was generated, e.g. GRMHD_ML.ipynb
or GRMHD_ML.py
. To map to the CPU when CUDA is not available, one should add the following code directly after the initialization of the net_loaded
object in the Loading section of the script in question:
# ...
if torch.cuda.is_available():
net_loaded.load_state_dict(torch.load("net.pth"))
else:
# Map the loaded network to the CPU.
net_loaded.load_state_dict(torch.load("net.pth", map_location=torch.device('cpu')))
# Load the optimizer from the .pth file
if torch.cuda.is_available():
optimizer_loaded_state_dict = torch.load("optimizer.pth")
else:
optimizer_loaded_state_dict = torch.load("optimizer.pth", map_location=torch.device('cpu'))
# Load the scheduler from the .pth file
if torch.cuda.is_available():
scheduler_loaded_state_dict = torch.load("scheduler.pth")
else:
scheduler_loaded_state_dict = torch.load("scheduler.pth", map_location=torch.device('cpu'))
# ...
It could be that one needs to map these state dictionaries to the CPU even if CUDA is in fact available. In that case we can simply replace the above if-else statements with the statements in the else-cases only:
# ...
net_loaded.load_state_dict(torch.load("net.pth", map_location=torch.device('cpu')))
optimizer_loaded_state_dict = torch.load("optimizer.pth", map_location=torch.device('cpu'))
scheduler_loaded_state_dict = torch.load("scheduler.pth", map_location=torch.device('cpu'))
# ...
On systems such as the MMAAMS workstation where CUDA is in fact available, but one wants to explicitly map to the CPU, the device
variable requires to be mapped also. This should be done before the mapping of the state dictionaries discussed above. Mapping device
makes sure that net_loaded
uses the correct device and that the input tensor that is used to trace the model and to then generate the net.pt
file is mapped to the correct device also. To map device
to the CPU, one should add before the code in the Loading section
net_loaded = Net(
n_layers_loaded,
n_units_loaded,
hidden_activation_loaded,
output_activation_loaded,
dropout_rate_loaded
).to(device)
, the line
device = torch.device("cpu")
Do not forget to save and run the python script after the changes discussed have been implemented so that the net.pt
file is updated accordingly.
Make sure that the get_ipython
lines in the .py
files are commented out when running these files on a system without Jupyter installed.
It is advised to use python3
command instead of python
command for running the scripts e.g. on the workstation, as python
can still be linked to version 2 of the language.
-
To resolve
error loading the model
, ensure that thenet.pt
file (not thenet.pth
file) is located in the directory specified by thepath_to_model
variable in the C++ script.path_to_model
should include the file name itself. Note that if one specifies a relative path inpath_to_model
, this path should point to the location ofnet.pt
relative to the executable, not relative to the source file. -
If the error persists it is likely due to trying to load a GPU-trained model on the CPU or vice versa (see Running models, trained on the GPU, on the CPU, and vice versa). For instance, if
std::cerr << e.what() << '\n';
outputs
[username@pc build]$ cmake --build . --config release && ./GRMHD
[ 50%] Building CXX object CMakeFiles/GRMHD.dir/GRMHD.cpp.o
[100%] Linking CXX executable GRMHD
[100%] Built target GRMHD
error loading the model, did you correctly set the path to the net.pt file?
error: Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend.
# ...
frame #26: torch::jit::load(std::string const&, c10::optional<c10::Device>, bool) + 0xac (0x7fc32e1d7c7c in /path/to/libtorch/lib/libtorch_cpu.so)
frame #27: main + 0xb6 (0x5573f837482a in ./GRMHD)
frame #28: <unknown function> + 0x23850 (0x7fc328c39850 in /usr/lib/libc.so.6)
frame #29: __libc_start_main + 0x8a (0x7fc328c3990a in /usr/lib/libc.so.6)
frame #30: _start + 0x25 (0x5573f8374435 in ./GRMHD)
, then this could be caused by the problem of trying to run a GPU-trained model on the CPU. To resolve the issue one should make the modifications as specified in Running models, trained on the GPU, on the CPU, and vice versa.
The file pointed to by the constant STUDY_NAME
is a pickle file that saves all trials of an Optuna study. Some parts of the code do not run if the file specified by STUDY_NAME
is not found. If there was no previous Optuna study, or it is unnecessary to load such an Optuna study again, STUDY_NAME
can simply be set to None
in order to run the code.
The function save_file
exists to save files to a specified Google Drive location; this is e.g. useful on Colab where the runtime which contains saved files is automatically deleted after a period of inactivity. It is required to load the definition of save_file
in the script even when Colab or Google Drive is not used to save the file. If Colab or Google Drive is not used, then save_file
does nothing.
This issue arises when class Net
is not defined. This class definition still needs to be known e.g. when loading a pre-trained model. The issue is resolved by having class Net
be defined.
This issue can e.g. arise when trying to load a model without training or optimizing. Note that the net
object is an instance of class Net
that is only used in optimizing and training; when we load a model without either training or optimizing, we use the net_loaded
object instead. Likewise, all the variables associated with net
, such as train_metrics
, test_metrics
, var_dict
, etc. are suffixed by _loaded
when we load a model without optimizing or training it beforehand: train_metrics_loaded
, test_metrics_loaded
, var_dict_loaded
etc. This was done so that loading of a model could be done without overriding the original variables by redefining them upon loading of a model and so that correct loading can be verified.