Skip to content

issue with training end, not loading all steps #138

@Vikas-kum

Description

@Vikas-kum

If training ends, trial downloads all the indexes, but invoke rule ends prematurely. See logs below.
The training ran for 90 steps, but rule concluded at step 60 .

[2020-01-08 22:57:14.339 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 INFO trial.py:197] Training has ended, will refresh one final time in 1 sec.
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 DEBUG index_reader.py:310] Loaded Index Files: upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198/index/000000000/000000000070_worker_0.json,upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198/index/000000000/000000000080_worker_0.json,upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198/index/000000000/000000000090_worker_0.json
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 INFO trial.py:209] Loaded all steps
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 DEBUG trial.py:211] Training Has Ended : last_complete_step was: 60
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 DEBUG trial.py:213] Training Has Ended : last_index_token was: upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198/index/000000000/000000000060_worker_0.json
[2020-01-08 22:57:15.361 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 INFO invoker.py:36] Looking for step 61 of mode GLOBAL and reached end of training. Max step available is 60
[2020-01-08 22:57:15.362 /codebuild/output/src046/src/github.com/awslabs/sagemaker-debugger-rules/tests/analysis/invoker.py_s3://smdebugcodebuildtest/upload/20200108_223713/a78b5eb/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578524198.2377198 INFO invoker.py:40] Ending execution of rule LossNotDecreasing with step=60

detailed logs for run : https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/codebuild/smdebug_tensorflow_zero_code_change_build;stream=43eabe45-3c36-41a3-977a-592035cbd404;filter=trial_loss_not_decreasing_tf_true_parallel_mode

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions