$ python -m pytest -n 3 --dist=loadfile -s -v ./tests/test_optimization.py
================================================================================================================================== test session starts ==================================================================================================================================
platform linux -- Python 3.9.7, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- .../python
cachedir: .pytest_cache
rootdir: .../transformers_manuel, configfile: setup.cfg
plugins: xdist-2.4.0, dash-2.0.0, forked-1.3.0, timeout-2.0.1
[gw0] linux Python 3.9.7 cwd: .../transformers_manuel
[gw1] linux Python 3.9.7 cwd: .../transformers_manuel
[gw2] linux Python 3.9.7 cwd: .../transformers_manuel
[gw0] Python 3.9.7 (default, Sep 16 2021, 13:09:58)  -- [GCC 7.5.0]
[gw2] Python 3.9.7 (default, Sep 16 2021, 13:09:58)  -- [GCC 7.5.0]
[gw1] Python 3.9.7 (default, Sep 16 2021, 13:09:58)  -- [GCC 7.5.0]
gw0 [5] / gw1 [5] / gw2 [5]
scheduling tests via LoadFileScheduling

tests/test_optimization.py::OptimizationTest::test_adafactor 
[gw0] PASSED tests/test_optimization.py::OptimizationTest::test_adafactor 
tests/test_optimization.py::OptimizationTest::test_adam_w 
[gw0] PASSED tests/test_optimization.py::OptimizationTest::test_adam_w 
tests/test_optimization.py::OptimizationTest::test_compare_adamw_no_weight_decay 
[gw0] FAILED tests/test_optimization.py::OptimizationTest::test_compare_adamw_no_weight_decay 
tests/test_optimization.py::OptimizationTest::test_compare_adamw_with_weight_decay 
[gw0] FAILED tests/test_optimization.py::OptimizationTest::test_compare_adamw_with_weight_decay 
tests/test_optimization.py::ScheduleInitTest::test_schedulers 
[gw0] PASSED tests/test_optimization.py::ScheduleInitTest::test_schedulers 

======================================================================================================================================= FAILURES ========================================================================================================================================
__________________________________________________________________________________________________________________ OptimizationTest.test_compare_adamw_no_weight_decay __________________________________________________________________________________________________________________
[gw0] linux -- Python 3.9.7 .../python

self = <tests.test_optimization.OptimizationTest testMethod=test_compare_adamw_no_weight_decay>

    def test_compare_adamw_no_weight_decay(self):
>       self.util_adamw_comparison(weight_decay=0)

tests/test_optimization.py:120: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tests.test_optimization.OptimizationTest testMethod=test_compare_adamw_no_weight_decay>, weight_decay = 0

    def util_adamw_comparison(self, weight_decay):
        import torch
        import numpy as np
        model_size =1024
        lr = 0.1
        betas=(0.9, 0.999)
        eps = 1e-01
        rng_state = torch.get_rng_state()
        device = "cpu"
        torch.manual_seed(56)
        param_torch = torch.nn.Parameter(torch.randn(model_size, device=device))
        torch.set_rng_state(rng_state)
        torch.manual_seed(56)
        param_transf = torch.nn.Parameter(torch.randn(model_size, device=device))
        optimizer_torch = torch.optim.AdamW([param_torch], lr=lr, betas=betas, eps=eps, weight_decay=weight_decay)
        optimizer_transf = AdamW(params=[param_transf], lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=True)
    
    
        for i in range(100):
            rng_state = torch.get_rng_state()
            param_torch.grad = torch.randn(model_size, device=device)
            torch.set_rng_state(rng_state)
            param_transf.grad = torch.randn(model_size, device=device)
    
            optimizer_torch.step()
            optimizer_transf.step()
    
        atol=1e-3
    
        val_torch = param_torch.detach().numpy()
        val_transf = param_transf.detach().numpy()
>       np.testing.assert_allclose(val_transf, val_torch, err_msg="Mismatch between AdamW implementations!", rtol=0, atol=atol)
E       AssertionError: 
E       Not equal to tolerance rtol=0, atol=0.001
E       Mismatch between AdamW implementations!
E       Mismatched elements: 1022 / 1024 (99.8%)
E       Max absolute difference: 1.0594273
E       Max relative difference: 36.50006
E        x: array([-0.841293, -0.970589,  0.211384, ..., -0.593096,  0.516756,
E               2.300092], dtype=float32)
E        y: array([-0.945812, -1.075232,  0.213893, ..., -0.710561,  0.584081,
E               2.978668], dtype=float32)

tests/test_optimization.py:116: AssertionError
_________________________________________________________________________________________________________________ OptimizationTest.test_compare_adamw_with_weight_decay _________________________________________________________________________________________________________________
[gw0] linux -- Python 3.9.7 .../python

self = <tests.test_optimization.OptimizationTest testMethod=test_compare_adamw_with_weight_decay>

    def test_compare_adamw_with_weight_decay(self):
>       self.util_adamw_comparison(weight_decay=0.5)

tests/test_optimization.py:123: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tests.test_optimization.OptimizationTest testMethod=test_compare_adamw_with_weight_decay>, weight_decay = 0.5

    def util_adamw_comparison(self, weight_decay):
        import torch
        import numpy as np
        model_size =1024
        lr = 0.1
        betas=(0.9, 0.999)
        eps = 1e-01
        rng_state = torch.get_rng_state()
        device = "cpu"
        torch.manual_seed(56)
        param_torch = torch.nn.Parameter(torch.randn(model_size, device=device))
        torch.set_rng_state(rng_state)
        torch.manual_seed(56)
        param_transf = torch.nn.Parameter(torch.randn(model_size, device=device))
        optimizer_torch = torch.optim.AdamW([param_torch], lr=lr, betas=betas, eps=eps, weight_decay=weight_decay)
        optimizer_transf = AdamW(params=[param_transf], lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, correct_bias=True)
    
    
        for i in range(100):
            rng_state = torch.get_rng_state()
            param_torch.grad = torch.randn(model_size, device=device)
            torch.set_rng_state(rng_state)
            param_transf.grad = torch.randn(model_size, device=device)
    
            optimizer_torch.step()
            optimizer_transf.step()
    
        atol=1e-3
    
        val_torch = param_torch.detach().numpy()
        val_transf = param_transf.detach().numpy()
>       np.testing.assert_allclose(val_transf, val_torch, err_msg="Mismatch between AdamW implementations!", rtol=0, atol=atol)
E       AssertionError: 
E       Not equal to tolerance rtol=0, atol=0.001
E       Mismatch between AdamW implementations!
E       Mismatched elements: 1004 / 1024 (98%)
E       Max absolute difference: 0.23363012
E       Max relative difference: 10.406861
E        x: array([-0.295148,  0.150579,  0.081928, ...,  0.06077 ,  0.209031,
E               0.247049], dtype=float32)
E        y: array([-0.390061,  0.191684,  0.104576, ...,  0.083256,  0.267295,
E               0.319532], dtype=float32)

tests/test_optimization.py:116: AssertionError
=================================================================================================================================== warnings summary ====================================================================================================================================
.../lib/python3.9/site-packages/flatbuffers/compat.py:19
.../lib/python3.9/site-packages/flatbuffers/compat.py:19
.../lib/python3.9/site-packages/flatbuffers/compat.py:19
.../lib/python3.9/site-packages/flatbuffers/compat.py:19
  .../lib/python3.9/site-packages/flatbuffers/compat.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp

tests/test_optimization.py::ScheduleInitTest::test_schedulers
  .../lib/python3.9/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use `get_last_lr()`.
    warnings.warn("To get the last learning rate computed by the scheduler, "

tests/test_optimization.py::ScheduleInitTest::test_schedulers
  .../lib/python3.9/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

-- Docs: https://docs.pytest.org/en/stable/warnings.html
================================================================================================================================ short test summary info ================================================================================================================================
FAILED tests/test_optimization.py::OptimizationTest::test_compare_adamw_no_weight_decay - AssertionError: 
FAILED tests/test_optimization.py::OptimizationTest::test_compare_adamw_with_weight_decay - AssertionError: 
======================================================================================================================= 2 failed, 3 passed, 6 warnings in 15.24s ========================================================================================================================