Skip to content

Conversation

KickItLikeShika
Copy link
Contributor

Improving FractionalBias implementation to be compatible with DDP and add the necessary tests

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: contrib Contrib module label Mar 16, 2021
@KickItLikeShika
Copy link
Contributor Author

Regarding the TPU test failure https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2122146650, i guess we got the same issue in #1699, i will try to fix it the same way

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2021

@KickItLikeShika thanks ! yes, I think it should be division by zero somewhere...

@KickItLikeShika
Copy link
Contributor Author

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2021

@vfdev-5 can you please check this https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2122812029

Yes, this can happend. Let's reduce the tolerence to 1e-5 for xla

@KickItLikeShika
Copy link
Contributor Author

KickItLikeShika commented Mar 16, 2021

@vfdev-5 You mean by reducing the tolerence it should be smaller than 1e-15, right? It should be for example 1e-30

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2021

I mean pytest.approx tolerence, like here:

_test_distrib_integration(device, tol=1e-4)

@KickItLikeShika
Copy link
Contributor Author

@vfdev-5 i have tried to manipulate the tolerence or the eps but it didn't work, still getting the same error

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 16, 2021

@KickItLikeShika let's try to repro the issue locally. Please, follow these installation steps:

- name: Install Torch XLA and others
run: |
## Install openblas, mkl, gsutil
sudo apt-get install -y libopenblas-dev libomp5
pip install mkl requests gsutil
## Install torch & xla
curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
python pytorch-xla-env-setup.py --version "nightly"
## Install test deps and Ignite
pip install -r requirements-dev.txt
python setup.py install

and run the test:

- name: Run Tests
run: |
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hostedtoolcache/Python/3.6.10/x64/lib
export XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0"
export XRT_WORKERS="localservice:0;grpc://localhost:40934"
python -c "import torch_xla; print('torch xla version:', torch_xla.__version__)"
bash tests/run_tpu_tests.sh

or even

pytest tests/ignite/contrib/metrics/regression/test_fractional_bias.py::test_distrib_single_device_xla -vvv -m tpu

@KickItLikeShika
Copy link
Contributor Author

KickItLikeShika commented Mar 17, 2021

@vfdev-5 it works now, Thanks!

Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KickItLikeShika Thank you for this work! LGTM

@sdesrozis
Copy link
Contributor

Horovod is ko 😔

@KickItLikeShika
Copy link
Contributor Author

KickItLikeShika commented Mar 17, 2021

@sdesrozis yes i see! but i think it's not related, right? as the tests for FractionBias have been passed. Check this please https://github.com/pytorch/ignite/pull/1805/checks?check_run_id=2134663201#step:8:1313

@KickItLikeShika
Copy link
Contributor Author

@sdesrozis please restart the CI to see what's going to happen..

@sdesrozis
Copy link
Contributor

Let’s see tomorrow about this failure @vfdev-5

@KickItLikeShika
Copy link
Contributor Author

@sdesrozis it works fine, thanks!

@vfdev-5 vfdev-5 merged commit b3f7020 into pytorch:master Mar 17, 2021
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 17, 2021

@KickItLikeShika probably we have to adjust tolerance for GPU tests as well : https://app.circleci.com/pipelines/github/pytorch/ignite/1607/workflows/381ab87c-6226-466e-ab64-4bbe8a835f56/jobs/4649

Please, send a follow-up PR with tol ~1e-5

@KickItLikeShika KickItLikeShika deleted the improve-fractional-bias branch March 18, 2021 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: contrib Contrib module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants