Skip to content

Commit

Permalink
Simplify the entire framework.
Browse files Browse the repository at this point in the history
- Remove hyperparameter search feature.
- Update libraries.
- Reduce the dependency on mongodb.
  • Loading branch information
shono committed Jun 29, 2021
1 parent 0447807 commit b82cc69
Show file tree
Hide file tree
Showing 53 changed files with 953 additions and 1,487 deletions.
141 changes: 59 additions & 82 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,20 @@ A simple training platform for PyTorch.

The models are implemented in [mitorch-models](https://github.com/shonohs/mitorch_models).

There are two ways to use this platform. Use as a simple training command on local machine, or use as a service.

# Usage
To install,
# Install
Use python 3.8+.
```bash
pip install mitorch
```

# Usage

## Training Command
```bash
mitrain <config_filepath> <train_filepath> <val_filepath> [-w <pth_filepath>] [-o <output_filepath>] [-d]
```
- config_filepath
- Json-serialized training config. For the detail, please see the sample configs in samples/ directory.
- Json-serialized training config.
- train_filepath / val_filepath
- Filepath to the training / validation dataset
- weights_filepath
Expand All @@ -28,99 +28,76 @@ mitrain <config_filepath> <train_filepath> <val_filepath> [-w <pth_filepath>] [-

The validatoin results will be printed on stdout.

## Validation Command
```bash
mitest <config_filepath> <train_filepath> <val_filepath> -w <weights_filepath> [-d]
## Training config
The definition of training config is in mitorch/common/training_config.py.

Here is an example configuration.
```
{
"model": {
"input_size": 224,
"name": "MobileNetV3"
},
"optimizer": {
"name": "sgd",
"momentum": 0.9,
"weight_decay": 0.0001
},
"lr_scheduler": {
"name": "cosine_annealing",
"base_lr": 0.01
},
"augmentation": { # The names are defined in mitorch/datasets/factory.py.
"train": "random_resize",
"val": "center_crop"
},
"dataset": { # This setting is used by mitorch-agent.
"train": "mnist/train_images.txt",
"val": "mnist/test_images.txt"
},
"batch_size": 2,
"max_epochs": 5,
"task_type": "multiclass_classification",
"num_processes": 1 # For mitorch-agent. Specify the number of GPU/CPU for the training.
}
```

## Dataset format
See [simpledataset](https://github.com/shonohs/simpledataset).

# Usage as a service
This library can be used as a service using AzureML and MongoDB. Queue training jobs to MongoDB, and the trainings will be run on AzureML instances. The training results will be stored on MongoDB after the trainings.
# Advanced usage: experiment management
You can manage experiments on remote machines using this framework.

## Setup
First, you need to createa AzureML and MongoDB resource on Azure portal. For the detail of this step, please read the Azure official documents. Once you set up the resources, collect the following informations.
- Subscription ID
- AzureML workspace name
- AzureML compute cluster name
- Username/password to access the AzureML resource. (Service Principal is recommended)
- MongoDB Endpoing with the access token
First, please set up an Azure Blob Storage and a mongo DB account.

- Blob Storage URL with SAS token
- MongoDB URL with the access token

Set those information to the following environment variables.
```bash
export MITORCH_AZURE_SUBSCRIPTION_ID=<subscription id>
export MITORCH_AML_WORKSPACE=<workspace name>
export MITORCH_AML_COMPUTE=<compute cluster name>
export MITORCH_AML_AUTH=<username for AML>:<password for AML>
export MITORCH_DB_URI=<MongoDB endpoint>
export MITORCH_STORAGE_URL=<storage sas url>
export MITORCH_DATABASE_URL=<MongoDB endpoint>
```

Second, upload the datasets to Azure Blob Storage. Our dataset format is described in the later section. Register the dataset infomation to the MongoDB using midataset command.
## Queue a job
```bash
midataset register <datasets_definition_json>
```
The format of the dataset definition is:
```javascript
[{"name": "dataset_name",
"version": 0,
"train": {"path": "path", "support_files": ["path"]}, // "path" is a Azure Blob storage URL with a sas token
"val": {"path": "path", "support_files": ["path"]}}]
misubmit <config_filepath>
```
This command will send a config file to the Mongo DB.

Third, run micontrol every 5 minutes. You can use any method to achive this step as long as the environment varialbes are correctly provided. You can manually execute them every 5 minutes, you can set up a cron job, or deploy to Azure Functions (recommended).

That's it. Now you are ready to use the service.
## Run an agent
On a powerful machine you want to use, run the follwing command.
```bash
miagent --data <dataset_directory>
```
It will get a job from the Mongo DB, train it, and save the results to the MongoDB and the Blob storage.

## Commands
```bash
# Queue a new training
misubmit <config_filepath> [--priority <priority>]

# Queue a hyper-parameter search job
misubmit <job_config_filepath> [--priority <priority>]

# Get status of a training
miquery <training_id or job_id>

# Launch a web UI for managing the service
miviewer
```

## Data structures
### Training job config
```javascript
{"_id": "<guid>",
"status": "<status>", // "new", "running", "failed", or "completed".
"prority": 100, // lower has more priority.
"created_at": "<datetime in utc>",
"dataset": "<dataset name>",
"config": {} // Training configs.
}
```
### Training config
```javascript
{"base": "<guid>", // Existing training id
"augmentation": {},
"lr_scheduler": {},
"model": {},
"optimizer": {}
}
```
### Job config structure
```javascript
{"job_type": "search", // Only "search" job is supported now.
}
# Get status of a training. If a job_id is not provided, it shows a list of jobs.
miquery [--job_id JOB_ID]
```
### Dataset format
TBD

### MongoDB database structure
This library will create one database on the given MongoDB endpoint. The database name is "mitorch" by default.

The database has the following collections.
- trainings
- Each record represents one training. New record will be added when a new training job is queued. The record will track the status of the job. Final evaluation results will be added to this record.
- training_results
- Training loss/validation loss for each training epochs. Those records will be updated real-time during the trainings.
- jobs
- Hyper-parameter search jobs will be stored in this collection. One job can create multiple trainings.
- datasets
- Information of registered datasets. This collection needs to be created manually before the trainings.
3 changes: 0 additions & 3 deletions mitorch/azureml/__init__.py

This file was deleted.

106 changes: 0 additions & 106 deletions mitorch/azureml/manager.py

This file was deleted.

Loading

0 comments on commit b82cc69

Please sign in to comment.