(wip)(do not merge) Run all tests on relay #338

notoraptor · 2020-03-23T20:29:05Z

Missing operations in relay backend to make all tests pass:

array_setitem
conv_transpose2d
gather
scatter
scatter_add

I am currently working on conv_transpose2d. Issues:

Seems limited to only groups=1 and dilation=(1, 1): https://github.com/apache/incubator-tvm/blob/master/python/tvm/relay/op/strategy/x86.py#L169
Persistent error when running pytest -xvvs tests/frontends/test_pytorch_ops.py::test_torch_conv2d on this branch:

(myia) notoraptor@notoraptor-linux:~/mila/dev/git/myia$ pytest -xvvs tests/frontends/test_pytorch_ops.py::test_torch_conv2d
=========================================================================================== test session starts ============================================================================================
platform linux -- Python 3.7.6, pytest-5.4.1, py-1.8.1, pluggy-0.13.1 -- /home/notoraptor/anaconda3/envs/myia/bin/python
cachedir: .pytest_cache
rootdir: /media/win/Users/notoraptor/mila/dev/git/myia, inifile: pytest.ini
plugins: cov-2.8.1
collected 6 items                                                                                                                                                                                          

tests/frontends/test_pytorch_ops.py::test_torch_conv2d[relay-cpu-grad0] <- tests/multitest.py ANTLR runtime and generated code versions disagree: 4.8!=4.7.2
ANTLR runtime and generated code versions disagree: 4.8!=4.7.2
Cannot find config for target=llvm, workload=('conv2d_NCHWc.x86', ('TENSOR', (2, 6, 4, 5), 'float32'), ('TENSOR', (3, 6, 3, 3), 'float32'), (2, 3), (3, 2, 3, 2), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
FAILED

================================================================================================= FAILURES =================================================================================================
____________________________________________________________________________________ test_torch_conv2d[relay-cpu-grad0] ____________________________________________________________________________________

test = <tests.multitest.MyiaFunctionTest object at 0x7f7570a72650>

    def runtest(test):
>       test.run(fn)

tests/multitest.py:66: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/multitest.py:166: in run
    return self.runtest(self, fn, **self.spec)
tests/frontends/test_pytorch_ops.py:147: in _fwd_and_bwd
    res = gpipeline.run(input=fn, argspec=[*argspec, sens_type])
myia/pipeline/pipeline.py:144: in run
    return self.make()(**args)
myia/pipeline/pipeline.py:201: in __call__
    return self[:](**args)
myia/pipeline/pipeline.py:245: in __call__
    raise results['error']
myia/pipeline/pipeline.py:229: in run_and_catch
    results = fn(**valid_args)
myia/pipeline/steps.py:402: in step_compile
    out = resources.backend.compile(graph, argspec, outspec)
myia/pipeline/resources.py:201: in compile
    return self.backend.compile(graph, argspec, outspec)
myia/compile/backends/__init__.py:277: in compile
    return self.proc.call_method('compile', graph, argspec, outspec)
myia/compile/channel/__init__.py:131: in call_method
    return self._read_msg()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <myia.compile.channel.RPCProcess object at 0x7f756e1fb5d0>

    def _read_msg(self):
        RemoteHandle.current_channel = self
        try:
            res = self.loader.get_data()
        finally:
            RemoteHandle.current_channel = None
        if isinstance(res, LoadedError):
>           raise res
E           myia.utils.serialize.LoadedError: Traceback (most recent call last):
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/channel/__main__.py", line 38, in _rpc_server
E               res = meth(*args, **kwargs)
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/backends/__init__.py", line 297, in compile
E               return handle(self.real.compile(graph, argspec, outspec))
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/backends/relay.py", line 884, in compile
E               self.exec_kind)
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/backends/relay.py", line 675, in run
E               add_functions(self.module, function_map, self.types)
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/backends/relay_helpers.py", line 275, in add_functions
E               mod[gv] = funcs[gv]
E           
E             File "/home/notoraptor/anaconda3/envs/myia/lib/python3.7/site-packages/tvm/ir/module.py", line 75, in __setitem__
E               return self._add(var, val)
E           
E             File "/home/notoraptor/anaconda3/envs/myia/lib/python3.7/site-packages/tvm/ir/module.py", line 84, in _add
E               _ffi_api.Module_Add(self, var, val, update)
E           
E             File "tvm/_ffi/_cython/./packed_func.pxi", line 308, in tvm._ffi._cy3.core.PackedFuncBase.__call__
E           
E             File "tvm/_ffi/_cython/./packed_func.pxi", line 253, in tvm._ffi._cy3.core.FuncCall
E           
E             File "tvm/_ffi/_cython/./base.pxi", line 159, in tvm._ffi._cy3.core.CALL
E           
E           tvm._ffi.base.TVMError: Traceback (most recent call last):
E             [bt] (8) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(TVMFuncCall+0x65) [0x7f8ea4b23a25]
E             [bt] (7) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x539dc4) [0x7f8ea440fdc4]
E             [bt] (6) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x539a24) [0x7f8ea440fa24]
E             [bt] (5) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(tvm::IRModuleNode::Add(tvm::GlobalVar const&, tvm::BaseFunc const&, bool)+0x425) [0x7f8ea440d875]
E             [bt] (4) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x5322d4) [0x7f8ea44082d4]
E             [bt] (3) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(tvm::relay::InferType(tvm::relay::Function const&, tvm::IRModule const&, tvm::GlobalVar const&)+0x1db) [0x7f8ea49c545b]
E             [bt] (2) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0xaeec28) [0x7f8ea49c4c28]
E             [bt] (1) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x5245fa) [0x7f8ea43fa5fa]
E             [bt] (0) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x423d01) [0x7f8ea42f9d01]
E             File "/home/user/conda-bld/tvm-libs_1584032126820/work/src/ir/error.cc", line 133
E           TVMError: 
E           Error(s) have occurred. The program has been annotated with them:
E           
E           In `main`: 
E           v0.0.4
E           fn (%v_parameter25: Tensor[(2, 6, 4, 5), float32], %v_parameter26: Tensor[(3, 6, 3, 3), float32], %v_parameter27: Tensor[(3), float32], %v_parameter28: float32) -> (Tensor[(2, 6, 4, 5), float32], Tensor[(3, 6, 3, 3), float32], Tensor[(3), float32]) {
E             let %seq.0 = (meta[relay.Constant][0],);
E             let %seq.1 = (meta[relay.Constant][1], meta[relay.Constant][2], meta[relay.Constant][3], meta[relay.Constant][4]);
E             let %seq.2 = (meta[relay.Constant][5], meta[relay.Constant][6], meta[relay.Constant][7], meta[relay.Constant][8]);
E             let %seq.3 = broadcast_to(%v_parameter28, meta[relay.attrs.InitOpAttrs][0]);
E             let %seq.4 = sum(%seq.3, axis=[0, 2, 3], keepdims=True);
E             let %seq.5 = reshape(%seq.4, newshape=[3]);
E             let %seq.6 = meta[relay.Constant][9];
E             let %seq.7 = (meta[relay.Constant][10], meta[relay.Constant][11]);
E             let %seq.8 = (meta[relay.Constant][12], meta[relay.Constant][13]);
E             let %seq.9 = (meta[relay.Constant][14], meta[relay.Constant][15]);
E             let %seq.10 = (meta[relay.Constant][16], meta[relay.Constant][17], meta[relay.Constant][18], meta[relay.Constant][19]);
E             %0 = reshape(%v_parameter25, newshape=[1, -1, 0, 0]);
E             %1 = tile(%seq.3, reps=[1, 6, 1, 1]);
E             %2 = reshape(%1, newshape=[-1, 1, 0, 0]);
E             %3 = nn.conv2d(%0, %2, padding=[3, 2, 3, 2], dilation=[2, 3], groups=12);
E             %4 = reshape(%3, newshape=[2, 6, 3, 4, 3]);
E             %5 = sum(%4, axis=[0]);
E             %6 = transpose(%5, axes=[1, 0, 2, 3]);
E             let %seq.11 = strided_slice(%6, begin=[0, 0, 0, 0], end=[None, None, 3, 3]);
E             let %seq.12 = (1, 0);
E             let %seq.13 = ();
E             let %seq.14 = nn.conv2d_transpose(%seq.3, %v_parameter26, channels=3, kernel_size=[3, 3], strides=[2, 3], output_padding=[1, 0], padding=[3, 2, 3, 2]) in particular dimension 1 conflicts 3 does not match 6; unable to unify: `Tensor[(3, 3, 3, 3), float32]` and `Tensor[(3, 6, 3, 3), float32]`; in particular dimension 1 conflicts 3 does not match 6; unable to unify: `Tensor[(2, 3, 4, 5), float32]` and `Tensor[(2, 6, 4, 5), float32]`; ;
E             let %seq.15 = (%seq.14, %seq.11, %seq.5);
E             %seq.15
E           }
E           // meta data omitted. you can use show_meta_data=True to include meta data

myia/compile/channel/__init__.py:148: LoadedError
========================================================================================= short test summary info ==========================================================================================
FAILED tests/frontends/test_pytorch_ops.py::test_torch_conv2d[relay-cpu-grad0] - myia.utils.serialize.LoadedError: Traceback (most recent call last):
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================================================ 1 failed in 17.73s ============================================================================================

The main error is in this line: let %seq.14 = nn.conv2d_transpose(%seq.3, %v_parameter26, channels=3, kernel_size=[3, 3], strides=[2, 3], output_padding=[1, 0], padding=[3, 2, 3, 2]) in particular dimension 1 conflicts 3 does not match 6; unable to unify: Tensor[(3, 3, 3, 3), float32]andTensor[(3, 6, 3, 3), float32]; in particular dimension 1 conflicts 3 does not match 6; unable to unify: Tensor[(2, 3, 4, 5), float32]andTensor[(2, 6, 4, 5), float32]; ;. I still don't know what causes this.

abergeron · 2020-03-23T21:05:55Z

I'll have to take a look at the failure. I'll do that after I review the live infer PR.

abergeron · 2020-03-24T17:56:11Z

My current impression is that there is a problem in the implementation of conv_transpose2d for relay (maybe only in certain cases).

I've made a patch to remove the bias argument since it's bad design. As for the root cause, I'm not 100% sure yet, and I may not have more time to look at it for a while since we are shifting to other projects for the time being.

notoraptor · 2020-03-31T11:46:23Z

Hi! I am still trying to fix conv_transpose2d in relay backend. After a bunch of tests, it seems to me that problems come from parameter output_padding when it's non-null.

I tested conv_transpose2d with same parameters but changing only output_padding with (0, 0), (1, 0) and (1, 1) (see pytest -v tests/frontends/test_pytorch_ops.py::test_conv_transpose2d in this branch: https://github.com/notoraptor/myia/blob/8dc356fbf53042eb8d91130bcbe2ff60030fb057/tests/frontends/test_pytorch_ops.py#L450 ), and it appears that:

with output_padding == (1, 0), some zero values were added in h_out. For example, if (h_out, w_out) == (5, 5), then the 5th row is filled with zeros, which is unexpected when compared to pytorch output
with output_padding == (0, 1), some zero values were added in w_out. For example, if (h_out, w_out) == (5, 5), then there is a zero at the end of each row, here also unexpected.
with output_padding == (1, 1), some zero values were added in both h_out and w_out, so that last row and last column are filled with unexpected zeros.
with output_padding == (0, 0), everything is computed normally.

So, I deduced that trailing rows and/or columns of computed input grad are filled with zeros when output padding is non-null. After checking TVM code, it seems that this zero-filling occurs here:

https://github.com/apache/incubator-tvm/blob/master/python/tvm/relay/op/strategy/generic.py#L308

So, I guess that output padding is still not well handled in TVM backend, and now I am working on fixing it directly in TVM code. To do that, I am checking Theano implementation of conv grad input, which is here:

https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/abstract_conv.py#L2927

From what I understand now, output padding should be used to guess real input shape, because conv2d may produce same output shape for different input shapes when stride is not (1, 1), but TVM currently uses output padding only after internal conv2d has run. Theano, instead, receives input_shape instead of output padding as parameter, and seems to handle it before running internal conv2d.

I still need further reading to fully understand everything, but I hope I could ultimately fix it this week, at most.

@abergeron @fosterrath-mila @breuleux

abergeron · 2020-03-31T15:29:55Z

See: apache/tvm#4318 for why the fix was reverted apparently.

Generalize pytorch backend tests to all backend.

Pass kwargs to eqtest through run. Update eqtest for torch.Tensor, taking atol and rtol. Replace bias with None and back to run_no_relay in test_torch_conv_transpose2d. Add testing in both backends for: - test_conv2d - test_conv2d__non_tuple_args - test_conv2d__group3 - test_conv_transpose2d

Apply flake8, pydocstyle myia, isort and black.

notoraptor · 2020-04-07T17:42:51Z

Hi! Still on conv2d_transpose.

I did not understand how to fix the bug addressed on old @abergeron TVM pull request related to conv2d_transpose ( here: apache/tvm#4318 (comment) ), so I decided to simply implement conv2d_transpose using Theano implementation, so that:

I could understand how conv2d_transpose is computed
Theano implementation works with all values for groups, dilation, strides, etc., thus, provides full support, while current TVM implementation is limited to groupes==1 and dilation=(1, 1)

As conv2d_transpose is ultimately a wrapper around conv2d, I started by implementing it in pure relay graph first. If necessary, I could later translate it directly into TVM code.

My current implementation is here: https://github.com/notoraptor/myia/blob/full-test-relay/tmp/shape_conv.py#L387

It's based on Theano implementation, which is fully extracted and cleaned here (works with numpy arrays, no theano symbols): https://github.com/notoraptor/myia/blob/full-test-relay/tmp/conv2d_transpose.py

I needed to add a relay operation dilate because I needed to dilate input tensor before passing it to conv2d. My TVM branch is here: apache/tvm@master...notoraptor:relay-op-dilate

Script https://github.com/notoraptor/myia/blob/full-test-relay/tmp/shape_conv.py can also be used to test Theano implementation (with parameter --theano) and relay graph implementation (with parameter --relay-graph). It uses Pytorch as reference implementation to compare values.

However, I currently face some hard bugs with, either segmentation fault, or script indefinitely running (I even need to manually kill process, as Ctrl+C won't work). Sometimes all tests in shape_conv.py with relay implementation pass, and sometimes I got those bugs, and it's a bit random. I don't yet know if it's something related to a memory leak or an access violation.

Theano implementation itself (in Python) seems to work as good as Pytorch, with various dilation, groups and strides. So, if I could fix these bugs, I guess we would have a good implementation for conv2d_transpose.

What do you think?

@abergeron @breuleux @fosterrath-mila

abergeron · 2020-05-05T15:40:37Z

Is this PR still needed? If so, please clean it up so it is mergable, otherwise close it.

notoraptor · 2020-05-05T16:37:01Z

@abergeron I will close it, as now I already know which operations are missing in relay backend.

notoraptor force-pushed the full-test-relay branch from ce57c0c to 1cfde49 Compare March 27, 2020 20:40

notoraptor force-pushed the full-test-relay branch from 8dc356f to 5f2eefd Compare March 31, 2020 12:02

notoraptor closed this Mar 31, 2020

notoraptor reopened this Mar 31, 2020

notoraptor added 11 commits April 3, 2020 05:34

Remove 'no relay' tests.

7c76f1e

Generalize pytorch backend tests to all backend.

Update tests.

967953a

Reduce atol to 1e-5 for test_lstm_cell.

a430927

Fix channels in conv2d_transpose

7f9c768

Remove bias support (apply @abergeron patch)

f3f6b81

Fixes.

440194b

Apply black

30e5cb6

Refine testing for conv2d_transpose

229cf58

Update public api tests.

0d0fd27

Add temporar files to test relay conv2d_transpose implementation.

d8a04d0

Apply flake8, pydocstyle myia, isort and black.

notoraptor force-pushed the full-test-relay branch from 5f2eefd to d8a04d0 Compare April 7, 2020 17:21

notoraptor mentioned this pull request Apr 14, 2020

Add relay implementation to fully support conv2d_transpose in relay backend #346

Merged

notoraptor closed this May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(wip)(do not merge) Run all tests on relay #338

(wip)(do not merge) Run all tests on relay #338

notoraptor commented Mar 23, 2020

abergeron commented Mar 23, 2020

abergeron commented Mar 24, 2020

notoraptor commented Mar 31, 2020

abergeron commented Mar 31, 2020

notoraptor commented Apr 7, 2020

abergeron commented May 5, 2020

notoraptor commented May 5, 2020

(wip)(do not merge) Run all tests on relay #338

(wip)(do not merge) Run all tests on relay #338

Conversation

notoraptor commented Mar 23, 2020

abergeron commented Mar 23, 2020

abergeron commented Mar 24, 2020

notoraptor commented Mar 31, 2020

abergeron commented Mar 31, 2020

notoraptor commented Apr 7, 2020

abergeron commented May 5, 2020

notoraptor commented May 5, 2020