-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
translate Cluster Training and Prediction #9356
translate Cluster Training and Prediction #9356
Conversation
doc/v2/faq/cluster/index_en.rst
Outdated
|
||
1. Network connection errors in the log during muliti-node cluster training | ||
------------------------------------------------ | ||
The errors in the log belong to network connection during mulilti-node cluster training, for example, :code:`Connection reset by peer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence only have subject but no predicate and no object.
doc/v2/faq/cluster/index_en.rst
Outdated
|
||
* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk. | ||
|
||
* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If network connection gave rise to the first error in the log" => If the first error in server.log
says "Address already used"
doc/v2/faq/cluster/index_en.rst
Outdated
|
||
* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk. | ||
|
||
* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connect with the operator => Contact the sys-admin
doc/v2/faq/cluster/index_en.rst
Outdated
|
||
* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk. | ||
|
||
* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, change the port of job. => If the current MPI cluster does not support this parameter, change the server port and try agian.
doc/v2/faq/cluster/index_en.rst
Outdated
|
||
* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job. | ||
|
||
* If the currnet MPI cluster does not support exclusive pattern, ask the operator to replace or update the current cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
people may want to know what the "exclusive pattern" is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operator => cluster administrator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
people may want to know what the "exclusive pattern" is.
Which doc should we refer this term to ?
doc/v2/faq/cluster/index_en.rst
Outdated
TBD | ||
.. contents:: | ||
|
||
1. Network connection errors in the log during muliti-node cluster training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
muliti-node -> multi-node
doc/v2/faq/cluster/index_en.rst
Outdated
.. contents:: | ||
|
||
1. Network connection errors in the log during muliti-node cluster training | ||
------------------------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mulilti-node -> multi-node
doc/v2/faq/cluster/index_en.rst
Outdated
1. Network connection errors in the log during muliti-node cluster training | ||
------------------------------------------------ | ||
The errors in the log belong to network connection during mulilti-node cluster training, for example, :code:`Connection reset by peer`. | ||
This kind of error is usually caused by the abnormal exit of the training process in some node, and the others cannot connect with this node any longer. Steps to troubleshoot the problem as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of the training process -> of a training process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
others -> other nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem as follows -> the problem are as follows
doc/v2/faq/cluster/index_en.rst
Outdated
|
||
* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job. | ||
|
||
* If the currnet MPI cluster does not support exclusive pattern, ask the operator to replace or update the current cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currnet -> current
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* support optimizer state offload (PaddlePaddle#8715) * support optimizer offload * update doc * [FleetY]offload optimizer state after load optmizer state (PaddlePaddle#9352) * add offload optimizer * fix memory * [FleetY] Add reload/offload for optimizer (PaddlePaddle#9356) * add reload/offload for optimizer * fix bug --------- Co-authored-by: Guoxia Wang <mingzilaochongtu@gmail.com>
fix #8954