From 83878c55dc6f0064e31024eb0f8e442d41cc29f6 Mon Sep 17 00:00:00 2001 From: tangwei12 Date: Mon, 25 Jun 2018 20:58:19 +0800 Subject: [PATCH 1/7] checkpoint doc --- .../howto/training/checkpoint_doc_cn.md | 56 +++++++++++++++++++ 1 file changed, 56 insertions(+) create mode 100644 source/user_guides/howto/training/checkpoint_doc_cn.md diff --git a/source/user_guides/howto/training/checkpoint_doc_cn.md b/source/user_guides/howto/training/checkpoint_doc_cn.md new file mode 100644 index 00000000000..6ab59e09e01 --- /dev/null +++ b/source/user_guides/howto/training/checkpoint_doc_cn.md @@ -0,0 +1,56 @@ +# Checkpoint功能使用指南 +## 背景 +单机/多机在训练过程中会由于软件/硬件的问题出现异常,导致训练中断,进而导致训练无结果或结果不可用,浪费大量时间和机器性能。 + +## 目的 +Checkpoint功能能够在训练中途对训练数据中间数据进行保存,出现异常恢复训练的时候能够加载中途保存的数据继续训练, 实现单机/多机的容错训练的功能。 + +## 说明 +### 目前已实现的参数保存: +1. 基于Trainer 0 实现训练过程中的参数保存 +2. 基于PServer 实现了Distribute Lookup Table相关参数保存 +### Fluid Checkpoint 保存数据目录结构: +checkpoint_dir (用户定义的checkpoint目录) +├── checkpoint_0 (第一次保存) +│ ├── __lockup_table__ (Distribute Lookup Table 目录) +│ │ ├── table_pserver_0 (Pserver 0 号保存的lookup table 数据) +│ │ └── table_pserver_1 +│ ├── __model__ (model 目录) +│ │ └── var.w_1 +│ └── trainer_0 (trainer 自有数据保存) +│ ├── epoch_id +│ └── step_id +└── checkpoint_1 (第二次保存) + +## 使用方法 +### 声明Fluid.CheckpointConfig +用户对checkpoint功能的配置,主要是配置对象Fluid.CheckpointConfig. +CheckpointConfig 包括4个参数: +```table +参数 | 类型 | 说明 +checkpoint_dir | int | checkpoint存储目录 +max_num_checkpoints | int | 最大保存的checkpoint副本数 +epoch_interval | int | 每隔epoch_interval轮epoch +step_interval | int | 每隔step_interval轮step +``` +### 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明 +Trainer的__init__方法的参数中包含了对CheckpointConfig, 需要传入在声明Trainer前声明的CheckpointConfig对象。 +如: +```python +config = CheckpointConfig( + checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2, + epoch_interval = 2, step_interval = 10) +trainer = Trainer(..., checkpoint_config=config) +``` +定义和声明完成后, 训练在运行过程中就会在指定的step和epoch处进行保存,出现异常时,就会自动从最新的checkpoint目录进行参数恢复啦! + +## 相关API +https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97 +## 注意 +1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。 +2. 最大副本数量max_num_checkpoints需要根据磁盘容量以及模型的大小进行调整, 保证磁盘的可用性。 +3. epoch_interval 和 step_interval 不宜过小, 频繁的进行checkpoint会拖慢训练速度。 +4. **分布式训练**的过程中:每个Trainer都会在checkpoint_dir目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同checkpoint_dir目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。 + +## 后续规划 +1. 支持通过etcd进行参数保存。 \ No newline at end of file From 5d07c45d516809d453ba93009690fdd7ac60a450 Mon Sep 17 00:00:00 2001 From: tangwei12 Date: Tue, 26 Jun 2018 11:17:38 +0800 Subject: [PATCH 2/7] checkpoint doc --- .../howto/training/checkpoint_doc_en.md | 59 +++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 source/user_guides/howto/training/checkpoint_doc_en.md diff --git a/source/user_guides/howto/training/checkpoint_doc_en.md b/source/user_guides/howto/training/checkpoint_doc_en.md new file mode 100644 index 00000000000..896affc146b --- /dev/null +++ b/source/user_guides/howto/training/checkpoint_doc_en.md @@ -0,0 +1,59 @@ +# Checkpoint User Guide +## Background +In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we taste so much time and the performance of machine, but get nothing, which make us frustrating and we have to restart it again. + +## Purpose +The feature of Checkpoint can save Intermediate model variables, lookup table variable and other needs datas in checkpoint directory. When the exception occurs, we can load this variables from the checkpoint directory immediately. +## Introduce +### Complete Features Currently: +1. The Trainer 0 will save model variables in training. +2. Each of the Trainer will save its own arguments needed. +3. Each of the Parameter Sever will save Distribute Lookup Table variables in training. +### Fluid Checkpoint directory structure: +checkpoint_dir (the checkpoint directory user define) +├── checkpoint_0 (the first save directory) +│ ├── __lockup_table__ (Distribute Lookup Table directory) +│ │ ├── table_pserver_0 (Lookup table's data about Pserver 0) +│ │ └── table_pserver_1 +│ ├── __model__ (model directory) +│ │ └── var.w_1 +│ └── trainer_0 (each trainer will save its own data) +│ ├── epoch_id +│ └── step_id +└── checkpoint_1 (the second save directory) + +## usage +### Fluid.CheckpointConfig construct +When user want to use Checkpoint feature, the main thing user have to do is declare Fluid.CheckpointConfig and construct it. + +CheckpointConfig has 4 member variables need to be initialized: +```table +Member Variable | Type | Comment +checkpoint_dir | int | checkpoint directory +max_num_checkpoints | int | Maximum number of checkpoint copies +epoch_interval | int | epoch interval times +step_interval | int | step interval times +``` +### Add Fluid.CheckpointConfig's declaration in Fluid.Trainer +Because the initialization of Trianer needs an instance of CheckpointConfig., we should decare Fluid.CheckpointConfig first. + +For example: +```python +config = CheckpointConfig( + checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2, + epoch_interval = 2, step_interval = 10) +trainer = Trainer(..., checkpoint_config=config) +``` + +After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, user can restart it, the train will restore from the latest copy. + +## Related API +https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97 +## Attention +1. Make the ```checkpoint_dir``` only be used by one train job. +2. The number of max_num_checkpoints need to be adjust by the disk size and model size. +3. Too frequently to slow down the trian speed, so too small epoch_interval and step_interval are not suitable. +4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model varibales). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole datas. + +## Next Plan +1. Save and restore checkpoint by etcd. \ No newline at end of file From 138401b7de15338c75de5f295b6c6f5397365df4 Mon Sep 17 00:00:00 2001 From: tangwei12 Date: Wed, 27 Jun 2018 15:09:31 +0800 Subject: [PATCH 3/7] update doc to reStructuredText --- .../howto/training/checkpoint_doc_cn.md | 68 +++++++++++-------- 1 file changed, 41 insertions(+), 27 deletions(-) diff --git a/source/user_guides/howto/training/checkpoint_doc_cn.md b/source/user_guides/howto/training/checkpoint_doc_cn.md index 6ab59e09e01..dfaca42b36c 100644 --- a/source/user_guides/howto/training/checkpoint_doc_cn.md +++ b/source/user_guides/howto/training/checkpoint_doc_cn.md @@ -1,12 +1,20 @@ -# Checkpoint功能使用指南 -## 背景 + +###################### +Checkpoint功能使用指南 +###################### + +背景 +---- + 单机/多机在训练过程中会由于软件/硬件的问题出现异常,导致训练中断,进而导致训练无结果或结果不可用,浪费大量时间和机器性能。 -## 目的 +目的 +---- Checkpoint功能能够在训练中途对训练数据中间数据进行保存,出现异常恢复训练的时候能够加载中途保存的数据继续训练, 实现单机/多机的容错训练的功能。 -## 说明 -### 目前已实现的参数保存: +说明 +---- +- 目前已实现的参数保存: 1. 基于Trainer 0 实现训练过程中的参数保存 2. 基于PServer 实现了Distribute Lookup Table相关参数保存 ### Fluid Checkpoint 保存数据目录结构: @@ -22,35 +30,41 @@ checkpoint_dir (用户定义的checkpoint目录) │ └── step_id └── checkpoint_1 (第二次保存) -## 使用方法 -### 声明Fluid.CheckpointConfig +使用方法 +-------- +1. 声明Fluid.CheckpointConfig 用户对checkpoint功能的配置,主要是配置对象Fluid.CheckpointConfig. CheckpointConfig 包括4个参数: -```table -参数 | 类型 | 说明 -checkpoint_dir | int | checkpoint存储目录 -max_num_checkpoints | int | 最大保存的checkpoint副本数 -epoch_interval | int | 每隔epoch_interval轮epoch -step_interval | int | 每隔step_interval轮step -``` -### 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明 +==================== ========== ================ + CheckpointConfig参数说明 +------------------------------------------------ + 参数 类型 说明 +==================== === ======================== +checkpoint_dir int checkpoint存储目录 +max_num_checkpoints int 最大保存的checkpoint副本数 +epoch_interval int 每隔epoch_interval轮epoch +step_interval int 每隔step_interval轮step +==================== === ======================== + +2. 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明 Trainer的__init__方法的参数中包含了对CheckpointConfig, 需要传入在声明Trainer前声明的CheckpointConfig对象。 如: -```python -config = CheckpointConfig( - checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2, - epoch_interval = 2, step_interval = 10) -trainer = Trainer(..., checkpoint_config=config) -``` +.. code-block:: python + + config = CheckpointConfig( + checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2, + epoch_interval = 2, step_interval = 10) + trainer = Trainer(..., checkpoint_config=config) + 定义和声明完成后, 训练在运行过程中就会在指定的step和epoch处进行保存,出现异常时,就会自动从最新的checkpoint目录进行参数恢复啦! -## 相关API +相关API +------- https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97 -## 注意 + +注意 +---- 1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。 2. 最大副本数量max_num_checkpoints需要根据磁盘容量以及模型的大小进行调整, 保证磁盘的可用性。 3. epoch_interval 和 step_interval 不宜过小, 频繁的进行checkpoint会拖慢训练速度。 -4. **分布式训练**的过程中:每个Trainer都会在checkpoint_dir目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同checkpoint_dir目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。 - -## 后续规划 -1. 支持通过etcd进行参数保存。 \ No newline at end of file +4. **分布式训练**的过程中:每个Trainer都会在checkpoint_dir目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同checkpoint_dir目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。 \ No newline at end of file From 238348ef8041900ecfd662f3db3bf510b15213d7 Mon Sep 17 00:00:00 2001 From: tangwei12 Date: Wed, 27 Jun 2018 16:27:32 +0800 Subject: [PATCH 4/7] Revert "update doc to reStructuredText" This reverts commit 138401b7de15338c75de5f295b6c6f5397365df4. --- .../howto/training/checkpoint_doc_cn.md | 68 ++++++++----------- 1 file changed, 27 insertions(+), 41 deletions(-) diff --git a/source/user_guides/howto/training/checkpoint_doc_cn.md b/source/user_guides/howto/training/checkpoint_doc_cn.md index dfaca42b36c..6ab59e09e01 100644 --- a/source/user_guides/howto/training/checkpoint_doc_cn.md +++ b/source/user_guides/howto/training/checkpoint_doc_cn.md @@ -1,20 +1,12 @@ - -###################### -Checkpoint功能使用指南 -###################### - -背景 ----- - +# Checkpoint功能使用指南 +## 背景 单机/多机在训练过程中会由于软件/硬件的问题出现异常,导致训练中断,进而导致训练无结果或结果不可用,浪费大量时间和机器性能。 -目的 ----- +## 目的 Checkpoint功能能够在训练中途对训练数据中间数据进行保存,出现异常恢复训练的时候能够加载中途保存的数据继续训练, 实现单机/多机的容错训练的功能。 -说明 ----- -- 目前已实现的参数保存: +## 说明 +### 目前已实现的参数保存: 1. 基于Trainer 0 实现训练过程中的参数保存 2. 基于PServer 实现了Distribute Lookup Table相关参数保存 ### Fluid Checkpoint 保存数据目录结构: @@ -30,41 +22,35 @@ checkpoint_dir (用户定义的checkpoint目录) │ └── step_id └── checkpoint_1 (第二次保存) -使用方法 --------- -1. 声明Fluid.CheckpointConfig +## 使用方法 +### 声明Fluid.CheckpointConfig 用户对checkpoint功能的配置,主要是配置对象Fluid.CheckpointConfig. CheckpointConfig 包括4个参数: -==================== ========== ================ - CheckpointConfig参数说明 ------------------------------------------------- - 参数 类型 说明 -==================== === ======================== -checkpoint_dir int checkpoint存储目录 -max_num_checkpoints int 最大保存的checkpoint副本数 -epoch_interval int 每隔epoch_interval轮epoch -step_interval int 每隔step_interval轮step -==================== === ======================== - -2. 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明 +```table +参数 | 类型 | 说明 +checkpoint_dir | int | checkpoint存储目录 +max_num_checkpoints | int | 最大保存的checkpoint副本数 +epoch_interval | int | 每隔epoch_interval轮epoch +step_interval | int | 每隔step_interval轮step +``` +### 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明 Trainer的__init__方法的参数中包含了对CheckpointConfig, 需要传入在声明Trainer前声明的CheckpointConfig对象。 如: -.. code-block:: python - - config = CheckpointConfig( - checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2, - epoch_interval = 2, step_interval = 10) - trainer = Trainer(..., checkpoint_config=config) - +```python +config = CheckpointConfig( + checkpoint_dir = "/tmp/ckpt", max_num_checkpoints = 2, + epoch_interval = 2, step_interval = 10) +trainer = Trainer(..., checkpoint_config=config) +``` 定义和声明完成后, 训练在运行过程中就会在指定的step和epoch处进行保存,出现异常时,就会自动从最新的checkpoint目录进行参数恢复啦! -相关API -------- +## 相关API https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97 - -注意 ----- +## 注意 1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。 2. 最大副本数量max_num_checkpoints需要根据磁盘容量以及模型的大小进行调整, 保证磁盘的可用性。 3. epoch_interval 和 step_interval 不宜过小, 频繁的进行checkpoint会拖慢训练速度。 -4. **分布式训练**的过程中:每个Trainer都会在checkpoint_dir目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同checkpoint_dir目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。 \ No newline at end of file +4. **分布式训练**的过程中:每个Trainer都会在checkpoint_dir目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同checkpoint_dir目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。 + +## 后续规划 +1. 支持通过etcd进行参数保存。 \ No newline at end of file From 0b4ab9c529dccf8d286fb5224e1d3e5bec79a7ed Mon Sep 17 00:00:00 2001 From: tangwei12 Date: Wed, 27 Jun 2018 17:24:32 +0800 Subject: [PATCH 5/7] =?UTF-8?q?=E6=96=87=E6=A1=A3=E6=A0=BC=E5=BC=8F?= =?UTF-8?q?=E5=92=8C=E5=86=85=E5=AE=B9=E4=BC=98=E5=8C=96?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../howto/training/checkpoint_doc_cn.md | 40 +++++++++-------- .../howto/training/checkpoint_doc_en.md | 43 ++++++++++--------- 2 files changed, 45 insertions(+), 38 deletions(-) diff --git a/source/user_guides/howto/training/checkpoint_doc_cn.md b/source/user_guides/howto/training/checkpoint_doc_cn.md index 6ab59e09e01..8ba67bcaeb7 100644 --- a/source/user_guides/howto/training/checkpoint_doc_cn.md +++ b/source/user_guides/howto/training/checkpoint_doc_cn.md @@ -1,4 +1,5 @@ # Checkpoint功能使用指南 + ## 背景 单机/多机在训练过程中会由于软件/硬件的问题出现异常,导致训练中断,进而导致训练无结果或结果不可用,浪费大量时间和机器性能。 @@ -8,8 +9,10 @@ Checkpoint功能能够在训练中途对训练数据中间数据进行保存, ## 说明 ### 目前已实现的参数保存: 1. 基于Trainer 0 实现训练过程中的参数保存 -2. 基于PServer 实现了Distribute Lookup Table相关参数保存 +2. 基于PServer 实现了```Distribute Lookup Table```相关参数保存 ### Fluid Checkpoint 保存数据目录结构: + +``` checkpoint_dir (用户定义的checkpoint目录) ├── checkpoint_0 (第一次保存) │ ├── __lockup_table__ (Distribute Lookup Table 目录) @@ -21,20 +24,23 @@ checkpoint_dir (用户定义的checkpoint目录) │ ├── epoch_id │ └── step_id └── checkpoint_1 (第二次保存) +``` ## 使用方法 ### 声明Fluid.CheckpointConfig -用户对checkpoint功能的配置,主要是配置对象Fluid.CheckpointConfig. -CheckpointConfig 包括4个参数: -```table -参数 | 类型 | 说明 -checkpoint_dir | int | checkpoint存储目录 -max_num_checkpoints | int | 最大保存的checkpoint副本数 -epoch_interval | int | 每隔epoch_interval轮epoch -step_interval | int | 每隔step_interval轮step -``` +用户对checkpoint功能的配置,主要是配置对象```Fluid```中的```CheckpointConfig```. + +```CheckpointConfig``` 包括4个参数: + +| 参数 | 类型 | 说明 | +| - | :-: | - | +| checkpoint_dir | int| checkpoint存储目录 | +| max_num_checkpoints | int | 最大保存的checkpoint副本数 | +| epoch_interval | int | 每隔epoch_interval轮epoch | +| step_interval | int | 每隔step_interval轮step | + ### 在Fluid.Trainer对象的声明中加入Fluid.CheckpointConfig的声明 -Trainer的__init__方法的参数中包含了对CheckpointConfig, 需要传入在声明Trainer前声明的CheckpointConfig对象。 +Trainer的__init__方法的参数中包含了对```CheckpointConfig```, 需要传入在声明Trainer前声明的```CheckpointConfig```对象。 如: ```python config = CheckpointConfig( @@ -45,12 +51,10 @@ trainer = Trainer(..., checkpoint_config=config) 定义和声明完成后, 训练在运行过程中就会在指定的step和epoch处进行保存,出现异常时,就会自动从最新的checkpoint目录进行参数恢复啦! ## 相关API -https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97 +Trainer API 说明: + ## 注意 1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。 -2. 最大副本数量max_num_checkpoints需要根据磁盘容量以及模型的大小进行调整, 保证磁盘的可用性。 -3. epoch_interval 和 step_interval 不宜过小, 频繁的进行checkpoint会拖慢训练速度。 -4. **分布式训练**的过程中:每个Trainer都会在checkpoint_dir目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同checkpoint_dir目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。 - -## 后续规划 -1. 支持通过etcd进行参数保存。 \ No newline at end of file +2. 最大副本数量```max_num_checkpoints```需要根据磁盘容量以及模型的大小进行调整, 保证磁盘的可用性。 +3. ```epoch_interval``` 和 ```step_interval``` 不宜过小, 频繁的进行checkpoint会拖慢训练速度。 +4. **分布式训练**的过程中:每个Trainer都会在```checkpoint_dir```目录中保存当前Trainer的参数(只有Trainer 0会保存模型的参数),需要**分布式文件系统(HDFS等)**将同```checkpoint_dir```目录的数据进行合并才能得到完整的数据,恢复训练的时候需要用完整的数据进行恢复。 \ No newline at end of file diff --git a/source/user_guides/howto/training/checkpoint_doc_en.md b/source/user_guides/howto/training/checkpoint_doc_en.md index 896affc146b..1561d1b1ccc 100644 --- a/source/user_guides/howto/training/checkpoint_doc_en.md +++ b/source/user_guides/howto/training/checkpoint_doc_en.md @@ -1,15 +1,18 @@ # Checkpoint User Guide + ## Background In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we taste so much time and the performance of machine, but get nothing, which make us frustrating and we have to restart it again. ## Purpose -The feature of Checkpoint can save Intermediate model variables, lookup table variable and other needs datas in checkpoint directory. When the exception occurs, we can load this variables from the checkpoint directory immediately. +The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable and other needs datas in checkpoint directory. When the exception occurs, we can load this variables from the checkpoint directory immediately. ## Introduce ### Complete Features Currently: 1. The Trainer 0 will save model variables in training. 2. Each of the Trainer will save its own arguments needed. -3. Each of the Parameter Sever will save Distribute Lookup Table variables in training. +3. Each of the Parameter Sever will save ```Distribute Lookup Table``` variables in training. ### Fluid Checkpoint directory structure: + +``` checkpoint_dir (the checkpoint directory user define) ├── checkpoint_0 (the first save directory) │ ├── __lockup_table__ (Distribute Lookup Table directory) @@ -21,21 +24,23 @@ checkpoint_dir (the checkpoint directory user define) │ ├── epoch_id │ └── step_id └── checkpoint_1 (the second save directory) +``` ## usage ### Fluid.CheckpointConfig construct -When user want to use Checkpoint feature, the main thing user have to do is declare Fluid.CheckpointConfig and construct it. - -CheckpointConfig has 4 member variables need to be initialized: -```table -Member Variable | Type | Comment -checkpoint_dir | int | checkpoint directory -max_num_checkpoints | int | Maximum number of checkpoint copies -epoch_interval | int | epoch interval times -step_interval | int | step interval times -``` +When user want to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it. + +```CheckpointConfig``` has 4 member variables need to be initialized: + +| Member Variable | Type | Comment | +| - | :-: | - | +| checkpoint_dir | int| checkpoint directory | +| max_num_checkpoints | int | Maximum number of checkpoint copies | +| epoch_interval | int | epoch interval times | +| step_interval | int | step interval times | + ### Add Fluid.CheckpointConfig's declaration in Fluid.Trainer -Because the initialization of Trianer needs an instance of CheckpointConfig., we should decare Fluid.CheckpointConfig first. +Because the initialization of Trianer needs an instance of ```CheckpointConfig```., we should decare ```CheckpointConfig``` in ```Fluid``` first. For example: ```python @@ -48,12 +53,10 @@ trainer = Trainer(..., checkpoint_config=config) After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, user can restart it, the train will restore from the latest copy. ## Related API -https://github.com/PaddlePaddle/Paddle/blob/3ff9ba0e6ba1eec282b6e89fb7bea2e2046f01c5/python/paddle/fluid/trainer.py#L97 +Related Trainer API: + ## Attention 1. Make the ```checkpoint_dir``` only be used by one train job. -2. The number of max_num_checkpoints need to be adjust by the disk size and model size. -3. Too frequently to slow down the trian speed, so too small epoch_interval and step_interval are not suitable. -4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model varibales). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole datas. - -## Next Plan -1. Save and restore checkpoint by etcd. \ No newline at end of file +2. The number of ```max_num_checkpoints``` need to be adjust by the disk size and model size. +3. Too frequently to slow down the trian speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable. +4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model varibales). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole datas. \ No newline at end of file From db5cfcb633d7dbe9d042a67a79a018a41fa5f7f3 Mon Sep 17 00:00:00 2001 From: tangwei12 Date: Wed, 27 Jun 2018 19:50:20 +0800 Subject: [PATCH 6/7] =?UTF-8?q?=E6=96=87=E6=A1=A3=E6=A0=BC=E5=BC=8F?= =?UTF-8?q?=E5=92=8C=E5=86=85=E5=AE=B9=E4=BC=98=E5=8C=96?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../howto/training/checkpoint_doc_cn.md | 2 +- .../howto/training/checkpoint_doc_en.md | 20 +++++++++---------- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/source/user_guides/howto/training/checkpoint_doc_cn.md b/source/user_guides/howto/training/checkpoint_doc_cn.md index 8ba67bcaeb7..51e07683f34 100644 --- a/source/user_guides/howto/training/checkpoint_doc_cn.md +++ b/source/user_guides/howto/training/checkpoint_doc_cn.md @@ -51,7 +51,7 @@ trainer = Trainer(..., checkpoint_config=config) 定义和声明完成后, 训练在运行过程中就会在指定的step和epoch处进行保存,出现异常时,就会自动从最新的checkpoint目录进行参数恢复啦! ## 相关API -Trainer API 说明: +[Trainer API 说明](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py) ## 注意 1. 保证每个训练的```checkpoint_dir``` 与其他训练独立。 diff --git a/source/user_guides/howto/training/checkpoint_doc_en.md b/source/user_guides/howto/training/checkpoint_doc_en.md index 1561d1b1ccc..60524f64016 100644 --- a/source/user_guides/howto/training/checkpoint_doc_en.md +++ b/source/user_guides/howto/training/checkpoint_doc_en.md @@ -1,15 +1,15 @@ # Checkpoint User Guide ## Background -In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we taste so much time and the performance of machine, but get nothing, which make us frustrating and we have to restart it again. +In many cases, Stand-alone training and Distributed training can be aborted by the software problem or hardware problem. More seriously, we waste so much time and the performance of the machine but get nothing, which makes us frustrating and we have to restart it again. ## Purpose -The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable and other needs datas in checkpoint directory. When the exception occurs, we can load this variables from the checkpoint directory immediately. +The feature of ```Checkpoint``` can save Intermediate model variables, lookup table variable, and other needs data in checkpoint directory. When the exception occurs, we can load these variables from the checkpoint directory immediately. ## Introduce ### Complete Features Currently: 1. The Trainer 0 will save model variables in training. 2. Each of the Trainer will save its own arguments needed. -3. Each of the Parameter Sever will save ```Distribute Lookup Table``` variables in training. +3. Each of the Parameter Server will save ```Distribute Lookup Table``` variables in training. ### Fluid Checkpoint directory structure: ``` @@ -28,7 +28,7 @@ checkpoint_dir (the checkpoint directory user define) ## usage ### Fluid.CheckpointConfig construct -When user want to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it. +When the user wants to use ```Checkpoint``` feature, the main thing user have to do is declare ```CheckpointConfig``` and construct it. ```CheckpointConfig``` has 4 member variables need to be initialized: @@ -40,7 +40,7 @@ When user want to use ```Checkpoint``` feature, the main thing user have to do i | step_interval | int | step interval times | ### Add Fluid.CheckpointConfig's declaration in Fluid.Trainer -Because the initialization of Trianer needs an instance of ```CheckpointConfig```., we should decare ```CheckpointConfig``` in ```Fluid``` first. +Because the initialization of Trainer needs an instance of ```CheckpointConfig```., we should declare ```CheckpointConfig``` in ```Fluid``` first. For example: ```python @@ -50,13 +50,13 @@ config = CheckpointConfig( trainer = Trainer(..., checkpoint_config=config) ``` -After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, user can restart it, the train will restore from the latest copy. +After all the things done, the train will save checkpoint at the specified epoch and step, when the train is aborted, the user can restart it, the train will restore from the latest copy. ## Related API -Related Trainer API: +[Related Trainer API](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/trainer.py) ## Attention 1. Make the ```checkpoint_dir``` only be used by one train job. -2. The number of ```max_num_checkpoints``` need to be adjust by the disk size and model size. -3. Too frequently to slow down the trian speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable. -4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model varibales). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole datas. \ No newline at end of file +2. The number of ```max_num_checkpoints``` need to be adjusted by the disk size and model size. +3. Too frequently to slow down the train speed, so too ```small epoch_interval``` and ```step_interval``` are not suitable. +4. **In distributed train**, each Trainer will save arguments in its ```checkpoint_dir``` (Only Trainer 0 will save model variables). We need **distributed file system (HDFS, etc)** to merge all the ```checkpoint_dir``` to get the whole data. \ No newline at end of file From 9f858c424311cadce2e1644620da2e5ff7bfbef6 Mon Sep 17 00:00:00 2001 From: yuyang18 Date: Thu, 28 Jun 2018 15:40:40 +0800 Subject: [PATCH 7/7] Include toc --- source/user_guides/howto/index.rst | 2 +- source/user_guides/howto/training/index.rst | 1 + source/user_guides/howto/training/multi_node.rst | 3 ++- source/user_guides/howto/training/save_load_variables.rst | 8 ++++++++ 4 files changed, 12 insertions(+), 2 deletions(-) diff --git a/source/user_guides/howto/index.rst b/source/user_guides/howto/index.rst index ab74722d4b0..c945054fd1c 100644 --- a/source/user_guides/howto/index.rst +++ b/source/user_guides/howto/index.rst @@ -3,7 +3,7 @@ #################### .. toctree:: - :maxdepth: 2 + :maxdepth: 2 prepare_data/index diff --git a/source/user_guides/howto/training/index.rst b/source/user_guides/howto/training/index.rst index a56dc986fb8..68475101e26 100644 --- a/source/user_guides/howto/training/index.rst +++ b/source/user_guides/howto/training/index.rst @@ -9,3 +9,4 @@ PaddlePaddle Fluid支持单机训练,和多节点训练。每种训练模式 single_node multi_node + save_load_variables diff --git a/source/user_guides/howto/training/multi_node.rst b/source/user_guides/howto/training/multi_node.rst index a6fee76533c..2e5092675ca 100644 --- a/source/user_guides/howto/training/multi_node.rst +++ b/source/user_guides/howto/training/multi_node.rst @@ -5,4 +5,5 @@ .. toctree:: :maxdepth: 2 - cluster_quick_start.md \ No newline at end of file + cluster_quick_start.rst + checkpoint_doc_cn.md \ No newline at end of file diff --git a/source/user_guides/howto/training/save_load_variables.rst b/source/user_guides/howto/training/save_load_variables.rst index 2be0146a23f..7d602312473 100644 --- a/source/user_guides/howto/training/save_load_variables.rst +++ b/source/user_guides/howto/training/save_load_variables.rst @@ -162,3 +162,11 @@ 完全一致,否则会导致变量加载错误或者未加载。另外,与 :code:`fluid.io.save_params` 类似, 运行 :code:`fluid.default_startup_program()` 也必须在 :code:`fluid.io.load_checkpoint` 之前进行。 + +多机checkpoint保存 +################## + +.. toctree:: + :maxdepth: 2 + + checkpoint_doc_cn.md \ No newline at end of file