Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Performing the Run process.md #714

Merged
merged 25 commits into from
May 31, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
09f082b
Create Performing the Run process.md
LiangWenshuo1118 Apr 25, 2022
65ccd26
Create Specify the Run process.md
LiangWenshuo1118 Apr 25, 2022
02818ab
Update and rename Specify the Run process.md to example-of-param.md
LiangWenshuo1118 Apr 27, 2022
8f76b12
Update and rename Performing the Run process.md to Overview-of-the-ru…
LiangWenshuo1118 Apr 27, 2022
054bf7e
Update and rename Overview-of-the-run-process.md to overview-of-the-r…
LiangWenshuo1118 Apr 28, 2022
bd10185
Update example-of-param.md
LiangWenshuo1118 Apr 28, 2022
40b1261
Update example-of-param.md
LiangWenshuo1118 Apr 28, 2022
e0d0276
Update example-of-param.md
LiangWenshuo1118 Apr 28, 2022
5b5b6a7
Create example-of-machine
LiangWenshuo1118 Apr 29, 2022
8a8391c
Rename example-of-machine to example-of-machine.md
LiangWenshuo1118 Apr 29, 2022
e4e8b7a
add param.rst
LiangWenshuo1118 May 6, 2022
76d1da9
Update param.rst
LiangWenshuo1118 May 6, 2022
74b641d
updata dpgen run param parameters
LiangWenshuo1118 May 6, 2022
5895748
Update param.rst
LiangWenshuo1118 May 6, 2022
2980124
Update param.rst
LiangWenshuo1118 May 6, 2022
b525471
Update example-of-param.md
LiangWenshuo1118 May 9, 2022
3139279
Update overview-of-the-run-process.md
LiangWenshuo1118 May 9, 2022
d62b87b
Update example-of-machine.md
LiangWenshuo1118 May 9, 2022
6a58293
Update example-of-machine.md
LiangWenshuo1118 May 9, 2022
9de5f3f
Update overview-of-the-run-process.md
LiangWenshuo1118 May 9, 2022
870b762
Create run-process.rst
LiangWenshuo1118 May 9, 2022
e892dc1
Update index.rst
LiangWenshuo1118 May 9, 2022
5dea165
Update param.rst
LiangWenshuo1118 May 9, 2022
846c006
Update param.rst
LiangWenshuo1118 May 9, 2022
b6acd00
Update index.rst
LiangWenshuo1118 May 11, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 39 additions & 2 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,50 @@
DPGEN's documentation
==========================

.. _parameters::
.. _overview::

.. toctree::
:maxdepth: 2
:caption: Parameters
:caption: Overview


.. _installation::

.. toctree::
:maxdepth: 2
:caption: Installation


.. _run::

.. toctree::
:maxdepth: 2
:caption: Run

run/run-process.rst
run/param.rst
run-mdata.rst

.. _init::

.. toctree::
:maxdepth: 2
:caption: Init


.. _autotest::

.. toctree::
:maxdepth: 2
:caption: Autotest


.. _simplify::

.. toctree::
:maxdepth: 2
:caption: Simplify


.. _tutorial:

Expand Down
118 changes: 118 additions & 0 deletions doc/run/example-of-machine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Example of machine.json

## DPDispatcher Update Note

DPDispatcher has updated and the api of machine.json is changed. DP-GEN will use the new DPDispatcher if the value of key "api_version" in machine.json is equal to or large than 1.0. And for now, DPDispatcher is maintained on a separate repo (https://github.com/deepmodeling/dpdispatcher). Please check the documents (https://deepmd.readthedocs.io/projects/dpdispatcher/en/latest/) for more information about the new DPDispatcher.

DP-GEN will use the old DPDispatcher if the key "api_version" is not specified in machine.json or the "api_version" is smaller than 1.0. This gurantees that the old machine.json still works.

## New DPDispatcher

Each iteration in the run process of DP-GEN is composed of three steps: exploration, labeling, and training. Accordingly, machine.json is composed of three parts: train, model_devi, and fp. Each part is a list of dicts. Each dict can be considered as an independent environment for calculation.

In this section, we will show you how to perform train task at a local workstation, model_devi task at a local Slurm cluster, and fp task at a remote PBS cluster using the new DPDispatcher. For each task, three types of keys are needed:
- Command: provides the command used to execute each step.
- Machine: specifies the machine environment (local workstation, local or remote cluster, or cloud server).
- Resources: specify the number of groups, nodes, CPU, and GPU; enable the virtual environment.

### Performing train task at a local workstation

In this example, we perform the `train` task on a local workstation.

```json
"train": [
{
"command": "dp",
"machine": {
"batch_type": "Shell",
"context_type": "local",
"local_root": "./",
"remote_root": "/home/user1234/work_path"
},
"resources": {
"number_node": 1,
"cpu_per_node": 4,
"gpu_per_node": 1,
"group_size": 1,
"source_list": ["/home/user1234/deepmd.env"]
}
}
],
```

The "command" for the train task in the DeePMD-kit is "dp".

In machine parameters, "batch_type" specifies the type of job scheduling system. If there is no job scheduling system, we can use the "Shell" to perform the task. "context_type" specifies the method of data transfer, and "local" means copying and moving data via local file storage systems (e.g. cp, mv, etc.). In DP-GEN, the paths of all tasks are automatically located and set by the software, and therefore "local_root" is always set to "./". The input file for each task will be sent to the "remote_root" and the task will be performed there, so we need to make sure that the path exists.

In the resources parameter, "number_node", "cpu_per_node", and "gpu_per_node" specify the number of nodes, the number of CPUs, and the number of GPUs required for a task respectively. "group_size", which needs to be highlighted, specifies how many tasks will be packed into a group. In the training tasks, we need to train 4 models. If we only have one GPU, we can set the "group_size" to 4. If "group_size" is set to 1, 4 models will be trained on one GPU at the same time, as there is no job scheduling system. Finally, the environment variables can be activated by "source_list". In this example, "source /home/user1234/deepmd.env" is executed before "dp" to load the environment variables necessary to perform the training task.

### Perform model_devi task at a local Slurm cluster

In this example, we perform the model_devi task at a local Slurm workstation.

```json
"model_devi": [
{
"command": "lmp",
"machine": {
"context_type": "local",
"batch_type": "Slurm",
"local_root": "./",
"remote_root": "/home/user1234/work_path"
},
"resources": {
"number_node": 1,
"cpu_per_node": 4,
"gpu_per_node": 1,
"queue_name": "QueueGPU",
"custom_flags" : ["#SBATCH --mem=32G"],
"group_size": 10,
"source_list": ["/home/user1234/lammps.env"]
}
}
],
```

The "command" for the model_devi task in the LAMMPS is "lmp".

In the machine parameter, we specify the type of job scheduling system by changing the "batch_type" to "Slurm".

In the resources parameter, we specify the name of the queue to which the task is submitted by adding "queue_name". We can add additional lines to the calculation script via the "custom_flags". In the model_devi steps, there are frequently many short tasks, so we usually pack multiple tasks (e.g. 10) into a group for submission. Other parameters are similar to that of the local workstation.

### Perform fp task in a remote PBS cluster

In this example, we perform the fp task at a remote PBS cluster that can be accessed via SSH.

```json
"fp": [
{
"command": "mpirun -n 32 vasp_std",
"machine": {
"context_type": "SSHContext",
"batch_type": "PBS",
"local_root": "./",
"remote_root": "/home/user1234/work_path",
"remote_profile": {
"hostname": "39.xxx.xx.xx",
"username": "user1234"
}
},
"resources": {
"number_node": 1,
"cpu_per_node": 32,
"gpu_per_node": 0,
"queue_name": "QueueCPU",
"group_size": 5,
"source_list": ["/home/user1234/vasp.env"]
}
}
],
```

VASP code is used for fp task and mpi is used for parallel computing, so "mpirun -n 32" is added to specify the number of parallel threads.

In the machine parameter, "context_type" is modified to "SSHContext" and "batch_type" is modified to "PBS". It is worth noting that "remote_root" should be set to an accessible path on the remote PBS cluster. "remote_profile" is added to specify the information used to connect the remote cluster, including hostname, username, port, etc.

In the resources parameter, we set "gpu_per_node" to 0 since it is cost-effective to use the CPU for VASP calculations.

Explicit descriptions of keys in machine.json will be given in the following section.
128 changes: 128 additions & 0 deletions doc/run/example-of-param.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Example-of-param.json

We have provided different examples of param.json in dpgen/examples/run/. In this section, we give a description of the param.json, taking dpgen/examples/run/dp2.x-lammps-vasp/param_CH4_deepmd-kit-2.0.1.json as an example. This is a param.json for a gas-phase methane molecule. Here, DeePMD-kit (v2.x), LAMMPS and VASP codes are used for training, exploration and labeling respectively.

## basics

The basics related keys in param.json are given as follows

```json
"type_map": [
"H",
"C"
],
"mass_map": [
1,
12
],
```

The basics related keys specify the basic information about the system. "type_map" gives the atom types, i.e. "H" and "C". "mass_map" gives the standard atom weights, i.e. "1" and "12".

## data

The data related keys in param.json are given as follows

```json
"init_data_prefix": "....../init/",
"init_data_sys": [
"CH4.POSCAR.01x01x01/02.md/sys-0004-0001/deepmd"
],

"sys_configs_prefix": "....../init/",
"sys_configs": [
[
"CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale*/00000*/POSCAR"
],
[
"CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale*/00001*/POSCAR"
]
],
```

The data related keys specify the init data for training initial DP models and structures used for model_devi calculations. "init_data_prefix" and "init_data_sys" specify the location of the init data. "sys_configs_prefix" and "sys_configs" specify the location of the structures.

Here, the init data is provided at "...... /init/CH4.POSCAR.01x01x01/02.md/sys-0004-0001/deepmd". These structures are divided into two groups and provided at "....../init/CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale*/00000*/POSCAR" and "....../init/CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale*/00001*/POSCAR".

## training

The training related keys in param.json are given as follows

```json
"numb_models": 4,
"train_param": "input.json",
"default_training_param": {
},
```
The training related keys specify the details of training tasks. "numb_models" specifies the number of models to be trained. "default_training_param" specifies the training parameters for `deepmd-kit`.

Here, 4 DP models will be trained in `00.train`. A detailed explanation of training parameters can be found in DeePMD-kit’s documentation (https://docs.deepmodeling.com/projects/deepmd/en/master/).

## exploration

The exploration related keys in param.json are given as follows

```json
"model_devi_dt": 0.002,
"model_devi_skip": 0,
"model_devi_f_trust_lo": 0.05,
"model_devi_f_trust_hi": 0.15,
"model_devi_clean_traj": true,
"model_devi_jobs": [
{
"sys_idx": [
0
],
"temps": [
100
],
"press": [
1.0
],
"trj_freq": 10,
"nsteps": 300,
"ensemble": "nvt",
"_idx": "00"
},
{
"sys_idx": [
1
],
"temps": [
100
],
"press": [
1.0
],
"trj_freq": 10,
"nsteps": 3000,
"ensemble": "nvt",
"_idx": "01"
}
],
```
The exploration related keys specify the details of exploration tasks. "model_devi_dt" specifies timestep for MD simulation. "model_devi_skip" specifies the number of structures skipped for saving in each MD. "model_devi_f_trust_lo" and "model_devi_f_trust_hi" specify the lower and upper bound of model_devi of forces for the selection. "model_devi_clean_traj" specifies whether to clean traj folders in MD. If type of model_devi_clean_traj is boolean type then it denote whether to clean traj folders in MD since they are too large.In "model_devi_jobs", "sys_idx" specifies the group of structures used for model_devi calculations, "temps" specifies the temperature (K) in MD, "press" specifies the pressure (Bar) in MD, "trj_freq" specifies the frequency of trajectory saved in MD, "nsteps" specifies the running steps of MD, "ensemble" specifies the ensemble used in MD, and "_idx" specifies the index of iteration.

Here, MD simulations are performed at the temperature of 100 K and the pressure of 1.0 Bar with an integrator time of 2 fs under the nvt ensemble. Two iterations are set in "model_devi_jobs". MD simulations are run for 300 and 3000 time steps with the first and second groups of structures in "sys_configs" in 00 and 01 iterations. We choose to save all structures generated in MD simulations and have set `"trj_freq"` as 10, so 30 and 300 structures are saved in 00 and 01 iterations. If the "max_devi_f" of saved structure falls between 0.05 and 0.15, DP-GEN will treat the structure as a candidate. We choose to clean traj folders in MD since they are too large. If you want to save the most recent n iterations of traj folders, you can set "model_devi_clean_traj" to be an integer.

## labeling

The labeling related keys in param.json are given as follows

```json
"fp_style": "vasp",
"shuffle_poscar": false,
"fp_task_max": 20,
"fp_task_min": 1,
"fp_pp_path": "....../methane/",
"fp_pp_files": [
"POTCAR"
],
"fp_incar": "....../INCAR_methane"
```

The labeling related keys specify the details of labeling tasks. "fp_style" specifies software for First Principles. "fp_task_max" and "fp_task_min" specify the minimum and maximum of structures to be calculated in `02.fp` of each iteration. "fp_pp_path" and "fp_pp_files" specify the location of the psuedo-potential file to be used for 02.fp. "fp_incar" specifies input file for VASP. INCAR must specify KSPACING and KGAMMA.

Here, a minimum of 1 and a maximum of 20 structures will be labeled using the VASP code with the INCAR provided at "....../INCAR_methane" and POTCAR provided at "....../methane/POTCAR" in each iteration. Note that the order of elements in POTCAR should correspond to the order in `type_map`.

All the keys of the DP-GEN are explained in detail in the section Parameters.
65 changes: 65 additions & 0 deletions doc/run/overview-of-the-run-process.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Overview of the Run process

The run process contains a series of successive iterations, undertaken in order such as heating the system to certain temperatures. Each iteration is composed of three steps: exploration, labeling, and training. Accordingly, there are three sub-folders: 00.train, 01.model_devi, and 02.fp in each iteration.

00.train: DP-GEN will train several (default 4) models based on initial and generated data. The only difference between these models is the random seed for neural network initialization.

01.model_devi : represent for model-deviation. DP-GEN will use models obtained from 00.train to run Molecular Dynamics(default LAMMPS). Larger deviation for structure properties (default is the force of atoms) means less accuracy of the models. Using this criterion, a few structures will be selected and put into the next stage 02.fp for more accurate calculation based on First Principles.

02.fp : Selected structures will be calculated by first-principles methods(default VASP). DP-GEN will obtain some new data and put them together with initial data and data generated in previous iterations. After that, new training will be set up and DP-GEN will enter the next iteration!

In the run process of the DP-GEN, we need to specify the basic information about the system, the initial data, and details of the training, exploration, and labeling tasks. In addition, we need to specify the software, machine environment, and computing resource and enable the process of job generation, submission, query, and collection automatically. We can perform the run process as we expect by specifying the keywords in param.json and machine.json, and they will be introduced in detail in the following sections.

Here, we give a general description of the run process. We can execute the run process of DP-GEN easily by:

```sh
dpgen run param.json machine.json
```

The following files or folders will be created and upgraded by codes:

- iter.00000x contains the main results that DP-GEN generates in the first iteration.
- record.dpgen records the current stage of the run process.
- dpgen.log includes time and iteration information.

When the first iteration is completed, the folder structure of iter.000000 is like this:

```sh
$ ls iter.000000
00.train 01.model_devi 02.fp
```

In folder iter.000000/ 00.train:

- Folder 00x contains the input and output files of the DeePMD-kit, in which a model is trained.
- graph.00x.pb is the model DeePMD-kit generates. The only difference between these models is the random seed for neural network initialization.

In folder iter.000000/ 01.model_devi:

- Folder confs contains the initial configurations for LAMMPS MD converted from POSCAR you set in "sys_configs" of param.json.
- Folder task.000.00000x contains the input and output files of the LAMMPS. In folder task.000.00000x, file model_devi.out records the model deviation of concerned labels, energy and force in MD. It serves as the criterion for selecting which structures and doing first-principle calculations.

In folder iter.000000/ 02.fp:

- candidate.shuffle.000.out records which structures will be selected from last step 01.model_devi. There are always far more candidates than the maximum you expect to calculate at one time. In this condition, DP-GEN will randomly choose up to `"fp_task_max"` structures and form the folder task.*.
- rest_accurate.shuffle.000.out records the other structures where our model is accurate ("max_devi_f" is less than `"model_devi_f_trust_lo"`, no need to calculate any more),
- rest_failed.shuffled.000.out records the other structures where our model is too inaccurate (larger than `"model_devi_f_trust_hi"`, there may be some error).
- data.000: After first-principle calculations, DP-GEN will collect these data and change them into the format DeePMD-kit needs. In the next iteration's 00.train, these data will be trained together as well as the initial data.

DP-GEN identifies the stage of the run process by a record file, record.dpgen, which will be created and upgraded by codes. Each line contains two numbers: the first is the index of iteration, and the second, ranging from 0 to 9, records which stage in each iteration is currently running.

| Index of iterations | Stage in eachiteration | Process |
|:---------------------|:----------------------------|:-----------------|
| 0 | 0 | make_train |
| 0 | 1 | run_train |
| 0 | 2 | post_train |
| 0 | 3 | make_model_devi |
| 0 | 4 | run_model_devi |
| 0 | 5 | post_model_devi |
| 0 | 6 | make_fp |
| 0 | 7 | run_fp |
| 0 | 8 | post_fp |

0,1,2 correspond to make_train, run_train, post_train. DP-GEN will write scripts in make_train, run the task by specific machine in run_train and collect result in post_train. The records for model_devi and fp stage follow similar rules.

If the process of DP-GEN stops for some reasons, DP-GEN will automatically recover the main process by record.dpgen. You may also change it manually for your purpose, such as removing the last iterations and recovering from one checkpoint.
Loading