Simplify the entire framework.

- Remove hyperparameter search feature. - Update libraries. - Reduce the dependency on mongodb.
shonohs · Jun 29, 2021 · b82cc69 · b82cc69
1 parent 0447807
commit b82cc69
Show file tree

Hide file tree

Showing 53 changed files with 953 additions and 1,487 deletions.
diff --git a/README.md b/README.md
@@ -3,20 +3,20 @@ A simple training platform for PyTorch.
 
 The models are implemented in [mitorch-models](https://github.com/shonohs/mitorch_models).
 
-There are two ways to use this platform. Use as a simple training command on local machine, or use as a service.
-
-# Usage
-To install,
+# Install
+Use python 3.8+.
 ```bash
 pip install mitorch
 ```
 
+# Usage
+
 ## Training Command
 ```bash
 mitrain <config_filepath> <train_filepath> <val_filepath> [-w <pth_filepath>] [-o <output_filepath>] [-d]
 ```
 - config_filepath
-  - Json-serialized training config. For the detail, please see the sample configs in samples/ directory.
+  - Json-serialized training config.
 - train_filepath / val_filepath
   - Filepath to the training / validation dataset
 - weights_filepath
@@ -28,99 +28,76 @@ mitrain <config_filepath> <train_filepath> <val_filepath> [-w <pth_filepath>] [-
 
 The validatoin results will be printed on stdout.
 
-## Validation Command
-```bash
-mitest <config_filepath> <train_filepath> <val_filepath> -w <weights_filepath> [-d]
+## Training config
+The definition of training config is in mitorch/common/training_config.py.
+
+Here is an example configuration.
 ```
+{
+    "model": {
+        "input_size": 224,
+        "name": "MobileNetV3"
+    },
+    "optimizer": {
+        "name": "sgd",
+        "momentum": 0.9,
+        "weight_decay": 0.0001
+    },
+    "lr_scheduler": {
+        "name": "cosine_annealing",
+        "base_lr": 0.01
+    },
+    "augmentation": {  # The names are defined in mitorch/datasets/factory.py.
+        "train": "random_resize",
+        "val": "center_crop"
+    },
+    "dataset": {  # This setting is used by mitorch-agent.
+        "train": "mnist/train_images.txt",
+        "val": "mnist/test_images.txt"
+    },
+    "batch_size": 2,
+    "max_epochs": 5,
+    "task_type": "multiclass_classification",
+    "num_processes": 1  # For mitorch-agent. Specify the number of GPU/CPU for the training.
+}
+```
+
+## Dataset format
+See [simpledataset](https://github.com/shonohs/simpledataset).
 
-# Usage as a service
-This library can be used as a service using AzureML and MongoDB. Queue training jobs to MongoDB, and the trainings will be run on AzureML instances. The training results will be stored on MongoDB after the trainings.
+# Advanced usage: experiment management
+You can manage experiments on remote machines using this framework. 
 
 ## Setup
-First, you need to createa AzureML and MongoDB resource on Azure portal. For the detail of this step, please read the Azure official documents. Once you set up the resources, collect the following informations.
-- Subscription ID
-- AzureML workspace name
-- AzureML compute cluster name
-- Username/password to access the AzureML resource. (Service Principal is recommended)
-- MongoDB Endpoing with the access token
+First, please set up an Azure Blob Storage and a mongo DB account.
+
+- Blob Storage URL with SAS token
+- MongoDB URL with the access token
 
 Set those information to the following environment variables.
 ```bash
-export MITORCH_AZURE_SUBSCRIPTION_ID=<subscription id>
-export MITORCH_AML_WORKSPACE=<workspace name>
-export MITORCH_AML_COMPUTE=<compute cluster name>
-export MITORCH_AML_AUTH=<username for AML>:<password for AML>
-export MITORCH_DB_URI=<MongoDB endpoint>
+export MITORCH_STORAGE_URL=<storage sas url>
+export MITORCH_DATABASE_URL=<MongoDB endpoint>
 ```
 
-Second, upload the datasets to Azure Blob Storage. Our dataset format is described in the later section. Register the dataset infomation to the MongoDB using midataset command.
+## Queue a job
 ```bash
-midataset register <datasets_definition_json>
-```
-The format of the dataset definition is:
-```javascript
-[{"name": "dataset_name",
-  "version": 0,
-  "train": {"path": "path", "support_files": ["path"]}, // "path" is a Azure Blob storage URL with a sas token
-  "val": {"path": "path", "support_files": ["path"]}}]
+misubmit <config_filepath>
 ```
+This command will send a config file to the Mongo DB. 
 
-Third, run micontrol every 5 minutes. You can use any method to achive this step as long as the environment varialbes are correctly provided. You can manually execute them every 5 minutes, you can set up a cron job, or deploy to Azure Functions (recommended).
-
-That's it. Now you are ready to use the service.
+## Run an agent
+On a powerful machine you want to use, run the follwing command.
+```bash
+miagent --data <dataset_directory>
+```
+It will get a job from the Mongo DB, train it, and save the results to the MongoDB and the Blob storage.
 
 ## Commands
 ```bash
 # Queue a new training
 misubmit <config_filepath> [--priority <priority>]
 
-# Queue a hyper-parameter search job
-misubmit <job_config_filepath> [--priority <priority>]
-
-# Get status of a training
-miquery <training_id or job_id>
-
-# Launch a web UI for managing the service
-miviewer
-```
-
-## Data structures
-### Training job config
-```javascript
-{"_id": "<guid>",
- "status": "<status>", // "new", "running", "failed", or "completed".
- "prority": 100, // lower has more priority.
- "created_at": "<datetime in utc>",
- "dataset": "<dataset name>",
- "config": {} // Training configs.
- }
-```
-### Training config
-```javascript
-{"base": "<guid>", // Existing training id
- "augmentation": {},
- "lr_scheduler": {},
- "model": {},
- "optimizer": {}
-}
-```
-### Job config structure
-```javascript
-{"job_type": "search", // Only "search" job is supported now.
-}
+# Get status of a training. If a job_id is not provided, it shows a list of jobs.
+miquery [--job_id JOB_ID]
 ```
-### Dataset format
-TBD
-
-### MongoDB database structure
-This library will create one database on the given MongoDB endpoint. The database name is "mitorch" by default.
-
-The database has the following collections.
-- trainings
-  - Each record represents one training. New record will be added when a new training job is queued. The record will track the status of the job. Final evaluation results will be added to this record.
-- training_results
-  - Training loss/validation loss for each training epochs. Those records will be updated real-time during the trainings.
-- jobs
-  - Hyper-parameter search jobs will be stored in this collection. One job can create multiple trainings.
-- datasets
-  - Information of registered datasets. This collection needs to be created manually before the trainings.
diff --git a/mitorch/azureml/__init__.py b/mitorch/azureml/__init__.py
diff --git a/mitorch/azureml/manager.py b/mitorch/azureml/manager.py