Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update faq/local/index_en.rst #9947

Merged
merged 2 commits into from
Apr 23, 2018
Merged

Conversation

jamesbing
Copy link
Contributor

@jamesbing jamesbing commented Apr 16, 2018

translation version 1.0
fix #8953

translation version 1.0
@CLAassistant
Copy link

CLAassistant commented Apr 16, 2018

CLA assistant check
All committers have signed the CLA.

@shanyi15 shanyi15 changed the title Update index_en.rst Update faq/local/index_en.rst Apr 16, 2018
TBD
.. contents::

1. Reduce Memory Consuming
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consuming -> Consumption

1. Reduce Memory Consuming
-------------------

The training procedure of neural networks demands dozens gigabytes of host memory or serval gigabytes of device memory, which is a rather memory consuming work. The memory consumed by PaddlePaddle framework mainly includes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dozens gigabytes -> dozens of gigabytes

Reduce DataProvider cache memory
++++++++++++++++++++++++++

PyDataProvider works under asynchronously mechanism, it loads together with the data fetch and shuffle procedure in host memory:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asynchronously -> asynchronous

Data Files -> Host Memory Pool -> PaddlePaddle Training
}

Thus the reduction of the DataProvider cache memory can reduce memory occupancy, meanwhile speed up the data loading procedure before training. However, the size of the memory pool can actually effect the granularity of shuffle,which means a shuffle operation is needed before each data file reading process to ensure the randomness of data when try to reduce the size of the memory pool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effect -> affect


.. literalinclude:: src/reduce_min_pool_size.py

In such way, the memory consuming can be significantly reduced and hence the training procedure can be accelerated. More details are demonstrated in :ref:`api_pydataprovider2`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such way -> In this way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory consuming -> memory consumption


* Parameters or gradients during training are oversize, which leads to floating overflow during calculation.
* The model failed to convergence and divert to a big value.
* Errors in training data leads to parameters converge to a singularity situation. This may also due to the large scale of input data, which contains millions of parameter values, and that will raise float overflow when operating matrix multiplication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors in training data leads to parameters converge to a singularity situation.

This sentence does not make any sense. What are you trying to say here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also due to -> also be due to


Details can refer to example `machine translation <https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/train.py#L66>`_ 。

The main difference of these two methods are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of these two -> between these two


The main difference of these two methods are:

1. They both block the gradient, but within different occasion,the former one happens when then :code:`optimzier` updates the network parameters while the latter happens when the back propagation computing of activation functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but within different occasion. What does this mean?

* Output sequence layer and non sequence layer;
* Multiple output layers process multiple sequence with different length;

Such issue can be avoid by calling infer interface and set :code:`flatten_result=False`. Thus, the infer interface returns a python list, in which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid -> avoided

7. Fetch parameters’ weight and gradient during training
-----------------------------------------------

Under certain situations, know the weights of currently training mini-batch can provide more inceptions of many problems. Their value can be acquired by printing values in :code:`event_handler` (note that to gain such parameters when training on GPU, you should set :code:`paddle.event.EndForwardBackward`). Detailed code is as following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

know -> knowing

@abhinavarora abhinavarora merged commit d060a7f into PaddlePaddle:develop Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Translation Plan-本地训练与预测-汉译英
4 participants