translate Cluster Training and Prediction #9356

m3ngyang · 2018-03-25T11:43:26Z

typhoonzero · 2018-03-25T14:13:14Z

doc/v2/faq/cluster/index_en.rst

+
+1. Network connection errors in the log during muliti-node cluster training
+------------------------------------------------
+The errors in the log belong to network connection during mulilti-node cluster training, for example, :code:`Connection reset by peer`.


This sentence only have subject but no predicate and no object.

typhoonzero · 2018-03-25T14:17:38Z

doc/v2/faq/cluster/index_en.rst

+
+* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk.
+
+* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.


"If network connection gave rise to the first error in the log" => If the first error in server.log says "Address already used"

typhoonzero · 2018-03-25T14:18:10Z

doc/v2/faq/cluster/index_en.rst

+
+* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk.
+
+* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.


Connect with the operator => Contact the sys-admin

typhoonzero · 2018-03-25T14:19:16Z

doc/v2/faq/cluster/index_en.rst

+
+* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk.
+
+* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.


If so, change the port of job. => If the current MPI cluster does not support this parameter, change the server port and try agian.

typhoonzero · 2018-03-25T14:19:52Z

doc/v2/faq/cluster/index_en.rst

+
+* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.
+
+* If the currnet MPI cluster does not support exclusive pattern, ask the operator to replace or update the current cluster.


people may want to know what the "exclusive pattern" is.

operator => cluster administrator

people may want to know what the "exclusive pattern" is.

Which doc should we refer this term to ?

abhinavarora · 2018-03-26T18:34:34Z

doc/v2/faq/cluster/index_en.rst

-TBD
+.. contents::
+
+1. Network connection errors in the log during muliti-node cluster training


muliti-node -> multi-node

abhinavarora · 2018-03-26T18:35:00Z

doc/v2/faq/cluster/index_en.rst

+.. contents::
+
+1. Network connection errors in the log during muliti-node cluster training
+------------------------------------------------


mulilti-node -> multi-node

abhinavarora · 2018-03-26T18:35:25Z

doc/v2/faq/cluster/index_en.rst

+1. Network connection errors in the log during muliti-node cluster training
+------------------------------------------------
+The errors in the log belong to network connection during mulilti-node cluster training, for example, :code:`Connection reset by peer`.
+This kind of error is usually caused by the abnormal exit of the training process in some node, and the others cannot connect with this node any longer. Steps to troubleshoot the problem as follows:


of the training process -> of a training process

others -> other nodes

the problem as follows -> the problem are as follows

abhinavarora · 2018-03-26T18:36:39Z

doc/v2/faq/cluster/index_en.rst

+
+* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.
+
+* If the currnet MPI cluster does not support exclusive pattern, ask the operator to replace or update the current cluster.


currnet -> current

m3ngyang · 2018-03-27T04:33:25Z

done

abhinavarora

LGTM

* support optimizer state offload (PaddlePaddle#8715) * support optimizer offload * update doc * [FleetY]offload optimizer state after load optmizer state (PaddlePaddle#9352) * add offload optimizer * fix memory * [FleetY] Add reload/offload for optimizer (PaddlePaddle#9356) * add reload/offload for optimizer * fix bug --------- Co-authored-by: Guoxia Wang <mingzilaochongtu@gmail.com>

translate Cluster Training and Prediction

efd7ee8

typhoonzero reviewed Mar 25, 2018

View reviewed changes

abhinavarora suggested changes Mar 26, 2018

View reviewed changes

fix typo

68c1994

abhinavarora approved these changes Mar 27, 2018

View reviewed changes

typhoonzero merged commit 33614ed into PaddlePaddle:develop Mar 28, 2018

m3ngyang deleted the cluster_trai_pred_trans branch November 23, 2021 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

translate Cluster Training and Prediction #9356

translate Cluster Training and Prediction #9356

m3ngyang commented Mar 25, 2018

typhoonzero Mar 25, 2018

typhoonzero Mar 25, 2018

typhoonzero Mar 25, 2018

typhoonzero Mar 25, 2018

typhoonzero Mar 25, 2018

typhoonzero Mar 25, 2018

m3ngyang Mar 25, 2018 •

edited

Loading

abhinavarora Mar 26, 2018

abhinavarora Mar 26, 2018

abhinavarora Mar 26, 2018

abhinavarora Mar 26, 2018

abhinavarora Mar 26, 2018

abhinavarora Mar 26, 2018

m3ngyang commented Mar 27, 2018

abhinavarora left a comment


		* Find the first error in the :code:`train.log`, :code:`server.log`, check whether other fault casued the problem, such as FPE, lacking of memory or disk.

		* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.


		* If network connection gave rise to the first error in the log, this may be caused by the port conflict of the non-exclusive execution. Connect with the operator to check if the current MPI cluster supports jobs submitted with parameter :code:`resource=full`. If so, change the port of job.

		* If the currnet MPI cluster does not support exclusive pattern, ask the operator to replace or update the current cluster.

translate Cluster Training and Prediction #9356

translate Cluster Training and Prediction #9356

Conversation

m3ngyang commented Mar 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m3ngyang Mar 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m3ngyang commented Mar 27, 2018

abhinavarora left a comment

Choose a reason for hiding this comment

m3ngyang Mar 25, 2018 •

edited

Loading