Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance PythonPackage easyblock to make sure all test command output makes it to the EasyBuild log, also when return_output_ec=True #2770

Merged
merged 2 commits into from
Aug 6, 2022

Conversation

casparvl
Copy link
Contributor

@casparvl casparvl commented Aug 5, 2022

Currently, the output is not written to the EasyBuild log. The output of the test command is parsed for errors and that part is stored, e.g.

== 2022-08-02 11:35:56,516 run.py:233 INFO running cmd: export PYTHONPATH=/scratch-shared/casparl/eb-dilodzki/tmp2w14ksbj/lib/python3.10/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /sw/arch/RHEL8/EB_production/2022/software/Python/3.10.4-GCCcore-11.3.0/bin/python run_test.py --continue-through-error  --verbose -x distributed/elastic/utils/distributed_test distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn test_optim test_model_dump distributed/fsdp/test_fsdp_memory distributed/fsdp/test_fsdp_overlap
== 2022-08-02 15:55:46,938 run.py:676 INFO parse_log_for_error msg: Command used: export PYTHONPATH=/scratch-shared/casparl/eb-dilodzki/tmp2w14ksbj/lib/python3.10/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /sw/arch/RHEL8/EB_production/2022/software/Python/3.10.4-GCCcore-11.3.0/bin/python run_test.py --continue-through-error  --verbose -x distributed/elastic/utils/distributed_test distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn test_optim test_model_dump distributed/fsdp/test_fsdp_memory distributed/fsdp/test_fsdp_overlap
== 2022-08-02 15:55:46,983 run.py:678 INFO parse_log_for_error (some may be harmless) regExp (?<![(,-]|\w)(?:error|segmentation fault|failed)(?![(,-]|\.?\w) found:
[W tensorpipe_agent.cpp:726] RPC agent for worker3 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker2 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker3 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker2 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.SUMMON_FULL_PARAMS
FSDP throws appropriate error when we wrap multi-device module. ... INFO:torch.testing._internal.common_distributed:Started process 0 with pid 775257
ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE
ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.FORWARD
Test that an error is raised if we attempt to wrap when submodules are ... INFO:torch.testing._internal.common_distributed:Started process 0 with pid 786905
Test that an error is raised if we attempt to wrap when submodules are ... INFO:torch.testing._internal.common_distributed:Started process 0 with pid 786950
Test that an error is raised if we attempt to wrap when submodules are ... INFO:torch.testing._internal.common_distributed:Started process 0 with pid 787005
Test that an error is raised if we attempt to wrap when submodules are ... INFO:torch.testing._internal.common_distributed:Started process 0 with pid 787052

but that's not always very useful if you're missing the context. In particular for PyTorch we'd want to have access to the full output in the EasyBuild log, but I imagine the same is true for any other Pythonpackage.

@boegel boegel changed the title Make sure all output makes it to the EasyBuild log, also when return_output_ec=True enhance PythonPackage easyblock to make sure all test command output makes it to the EasyBuild log, also when return_output_ec=True Aug 5, 2022
@boegel boegel added this to the next release (4.6.1?) milestone Aug 5, 2022
@boegel
Copy link
Member

boegel commented Aug 5, 2022

To clarify: not including the command output in the log when return_output_ec is enabled is a regression that was introduced via #2742 (but only for PyTorch).

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Aug 6, 2022

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3900.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 510.73.08, Python 3.6.8
See https://gist.github.com/c964651526b1beeec2417940edb9cc5d for a full test report.

@boegel boegel merged commit 8aa9219 into easybuilders:develop Aug 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants