-
Notifications
You must be signed in to change notification settings - Fork 6.8k
flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker #13484
Comments
another one here: It seems this flaky test happens pretty frequently. |
@mxnet-label-bot add [Flaky, Test, Python] |
Guys, disabling the test is not the right thing to do, I added MXNET_HOME variable (see docs/faq/env_var.md) to deal with these problems. The right thing to do, is fix the windows CI run to set a different MXNET_HOME for each CI worker so there's no concurrent access from different processes. |
…o concurrent data downloads Fixes apache#13484
…urrent data downloads Fixes apache#13484
…urrent data downloads Fixes apache#13484
…denied due t… (apache#13531) * Use MXNET_HOME in cwd in windows to prevent access denied due to concurrent data downloads Fixes apache#13484 * Revert "Disabled flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker (apache#13527)" This reverts commit 3d499cb.
Please reopen
|
This looks more like a IO stall than a bug. |
120 seconds is quite some time. Considering everything is happening on local volume, it's quite unlikely that the disk is so occupied. Could something be stuck? I think it's worth investigating. |
Any suggestions? is it reproducible? |
It may be a shared memory problem. Check shm used by using If you use docker. why? |
Any suggestions or fix? The issue still persist
|
Try to monitor your shared memory usage while running the program. If the problem is still here, try to increase the shared memory size. |
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13418/11/pipeline
The text was updated successfully, but these errors were encountered: