This is an accompanying repository for a paper called "Simdex-ML: an online ML reference example", which was submitted to the ECSA 2023 Tools & Demos track.
It builds on the "SIMDEX: ReCodEx Backend Simulator and Dataset" artifact (source code, paper).
This repository contains:
- A simulator of a job-processing backend of a real system enhanced with machine learning and reinforcement learning components.
- A dataset comprises a log of workloads metadata of real users collected from our instance of ReCodEx (a system for evaluation of coding assignments). The simulator can replay the logs, which provides rather unique evaluation based on real data.
The repository is ready to be used immediately as is (just clone it). You only need to have Python 3.7+ installed. If you are using Python virtual environment, do not forget to adjust paths to python3 and pip3 executables.
Install basic dependencies:
$> cd ./simulation
$> pip3 install -r ./requirements.txt
Quick check the scripts are running (on dataset sample):
$> python3 ./main.py --config ./experiments/user_experience_rl_nn_fast.yaml --refs ../data/release01-2021-12-29/ref-solutions.csv ../data/release01-2021-12-29/data-sample.csv
The simulator entry point is main.py script which is invoked as:
$> python3 ./main.py --config <path-to-config-file> [options] <path-to-data-file>
The config is in a .yaml file that is used to initialize the simulation. Config files for our examples are already in this repository and additional information can be found in the quick guide.
The data file is .csv or .csv.gz file that must be in the same format as our dataset.
Additional options recognized by the main script:
--refsoption holds one string value -- a path to reference solutions data file (.csvor.csv.gz), please note that ref. solutions must be loaded for some experiments--limitoption holds one integer, which is a maximal number of rows loaded from the data file (allows to restrict the number of simulated jobs)--progressis a bool flag that enables progress printouts to std. output (particularly useful for ML experiments that take a long time to process)--seedoption holds one integer which sets the seed for the random number generator--output_folderspecifies the folder where the simulation results will be saved. The output folder path can include special variables which are replaced by their values as follows:@@configwill be replaced by the name of the config file,@@datetimewill be replaced by the current date and time,@@seedwill be replaced with the random generator seed--inference_batch_sizeoption holds one integer, which defines the batch size for job duration prediction inference (default is1). This can be used to speed up the simulation if an NN is used for job duration prediction