Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(wip)(do not merge) Run all tests on relay #338

Closed
wants to merge 11 commits into from

Conversation

notoraptor
Copy link
Collaborator

Missing operations in relay backend to make all tests pass:

  • array_setitem
  • conv_transpose2d
  • gather
  • scatter
  • scatter_add

I am currently working on conv_transpose2d. Issues:

(myia) notoraptor@notoraptor-linux:~/mila/dev/git/myia$ pytest -xvvs tests/frontends/test_pytorch_ops.py::test_torch_conv2d
=========================================================================================== test session starts ============================================================================================
platform linux -- Python 3.7.6, pytest-5.4.1, py-1.8.1, pluggy-0.13.1 -- /home/notoraptor/anaconda3/envs/myia/bin/python
cachedir: .pytest_cache
rootdir: /media/win/Users/notoraptor/mila/dev/git/myia, inifile: pytest.ini
plugins: cov-2.8.1
collected 6 items                                                                                                                                                                                          

tests/frontends/test_pytorch_ops.py::test_torch_conv2d[relay-cpu-grad0] <- tests/multitest.py ANTLR runtime and generated code versions disagree: 4.8!=4.7.2
ANTLR runtime and generated code versions disagree: 4.8!=4.7.2
Cannot find config for target=llvm, workload=('conv2d_NCHWc.x86', ('TENSOR', (2, 6, 4, 5), 'float32'), ('TENSOR', (3, 6, 3, 3), 'float32'), (2, 3), (3, 2, 3, 2), (1, 1), 'NCHW', 'NCHW', 'float32'). A fallback configuration is used, which may bring great performance regression.
FAILED

================================================================================================= FAILURES =================================================================================================
____________________________________________________________________________________ test_torch_conv2d[relay-cpu-grad0] ____________________________________________________________________________________

test = <tests.multitest.MyiaFunctionTest object at 0x7f7570a72650>

    def runtest(test):
>       test.run(fn)

tests/multitest.py:66: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/multitest.py:166: in run
    return self.runtest(self, fn, **self.spec)
tests/frontends/test_pytorch_ops.py:147: in _fwd_and_bwd
    res = gpipeline.run(input=fn, argspec=[*argspec, sens_type])
myia/pipeline/pipeline.py:144: in run
    return self.make()(**args)
myia/pipeline/pipeline.py:201: in __call__
    return self[:](**args)
myia/pipeline/pipeline.py:245: in __call__
    raise results['error']
myia/pipeline/pipeline.py:229: in run_and_catch
    results = fn(**valid_args)
myia/pipeline/steps.py:402: in step_compile
    out = resources.backend.compile(graph, argspec, outspec)
myia/pipeline/resources.py:201: in compile
    return self.backend.compile(graph, argspec, outspec)
myia/compile/backends/__init__.py:277: in compile
    return self.proc.call_method('compile', graph, argspec, outspec)
myia/compile/channel/__init__.py:131: in call_method
    return self._read_msg()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <myia.compile.channel.RPCProcess object at 0x7f756e1fb5d0>

    def _read_msg(self):
        RemoteHandle.current_channel = self
        try:
            res = self.loader.get_data()
        finally:
            RemoteHandle.current_channel = None
        if isinstance(res, LoadedError):
>           raise res
E           myia.utils.serialize.LoadedError: Traceback (most recent call last):
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/channel/__main__.py", line 38, in _rpc_server
E               res = meth(*args, **kwargs)
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/backends/__init__.py", line 297, in compile
E               return handle(self.real.compile(graph, argspec, outspec))
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/backends/relay.py", line 884, in compile
E               self.exec_kind)
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/backends/relay.py", line 675, in run
E               add_functions(self.module, function_map, self.types)
E           
E             File "/media/win/Users/notoraptor/mila/dev/git/myia/myia/compile/backends/relay_helpers.py", line 275, in add_functions
E               mod[gv] = funcs[gv]
E           
E             File "/home/notoraptor/anaconda3/envs/myia/lib/python3.7/site-packages/tvm/ir/module.py", line 75, in __setitem__
E               return self._add(var, val)
E           
E             File "/home/notoraptor/anaconda3/envs/myia/lib/python3.7/site-packages/tvm/ir/module.py", line 84, in _add
E               _ffi_api.Module_Add(self, var, val, update)
E           
E             File "tvm/_ffi/_cython/./packed_func.pxi", line 308, in tvm._ffi._cy3.core.PackedFuncBase.__call__
E           
E             File "tvm/_ffi/_cython/./packed_func.pxi", line 253, in tvm._ffi._cy3.core.FuncCall
E           
E             File "tvm/_ffi/_cython/./base.pxi", line 159, in tvm._ffi._cy3.core.CALL
E           
E           tvm._ffi.base.TVMError: Traceback (most recent call last):
E             [bt] (8) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(TVMFuncCall+0x65) [0x7f8ea4b23a25]
E             [bt] (7) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x539dc4) [0x7f8ea440fdc4]
E             [bt] (6) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x539a24) [0x7f8ea440fa24]
E             [bt] (5) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(tvm::IRModuleNode::Add(tvm::GlobalVar const&, tvm::BaseFunc const&, bool)+0x425) [0x7f8ea440d875]
E             [bt] (4) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x5322d4) [0x7f8ea44082d4]
E             [bt] (3) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(tvm::relay::InferType(tvm::relay::Function const&, tvm::IRModule const&, tvm::GlobalVar const&)+0x1db) [0x7f8ea49c545b]
E             [bt] (2) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0xaeec28) [0x7f8ea49c4c28]
E             [bt] (1) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x5245fa) [0x7f8ea43fa5fa]
E             [bt] (0) /home/notoraptor/anaconda3/envs/myia/lib/libtvm.so(+0x423d01) [0x7f8ea42f9d01]
E             File "/home/user/conda-bld/tvm-libs_1584032126820/work/src/ir/error.cc", line 133
E           TVMError: 
E           Error(s) have occurred. The program has been annotated with them:
E           
E           In `main`: 
E           v0.0.4
E           fn (%v_parameter25: Tensor[(2, 6, 4, 5), float32], %v_parameter26: Tensor[(3, 6, 3, 3), float32], %v_parameter27: Tensor[(3), float32], %v_parameter28: float32) -> (Tensor[(2, 6, 4, 5), float32], Tensor[(3, 6, 3, 3), float32], Tensor[(3), float32]) {
E             let %seq.0 = (meta[relay.Constant][0],);
E             let %seq.1 = (meta[relay.Constant][1], meta[relay.Constant][2], meta[relay.Constant][3], meta[relay.Constant][4]);
E             let %seq.2 = (meta[relay.Constant][5], meta[relay.Constant][6], meta[relay.Constant][7], meta[relay.Constant][8]);
E             let %seq.3 = broadcast_to(%v_parameter28, meta[relay.attrs.InitOpAttrs][0]);
E             let %seq.4 = sum(%seq.3, axis=[0, 2, 3], keepdims=True);
E             let %seq.5 = reshape(%seq.4, newshape=[3]);
E             let %seq.6 = meta[relay.Constant][9];
E             let %seq.7 = (meta[relay.Constant][10], meta[relay.Constant][11]);
E             let %seq.8 = (meta[relay.Constant][12], meta[relay.Constant][13]);
E             let %seq.9 = (meta[relay.Constant][14], meta[relay.Constant][15]);
E             let %seq.10 = (meta[relay.Constant][16], meta[relay.Constant][17], meta[relay.Constant][18], meta[relay.Constant][19]);
E             %0 = reshape(%v_parameter25, newshape=[1, -1, 0, 0]);
E             %1 = tile(%seq.3, reps=[1, 6, 1, 1]);
E             %2 = reshape(%1, newshape=[-1, 1, 0, 0]);
E             %3 = nn.conv2d(%0, %2, padding=[3, 2, 3, 2], dilation=[2, 3], groups=12);
E             %4 = reshape(%3, newshape=[2, 6, 3, 4, 3]);
E             %5 = sum(%4, axis=[0]);
E             %6 = transpose(%5, axes=[1, 0, 2, 3]);
E             let %seq.11 = strided_slice(%6, begin=[0, 0, 0, 0], end=[None, None, 3, 3]);
E             let %seq.12 = (1, 0);
E             let %seq.13 = ();
E             let %seq.14 = nn.conv2d_transpose(%seq.3, %v_parameter26, channels=3, kernel_size=[3, 3], strides=[2, 3], output_padding=[1, 0], padding=[3, 2, 3, 2]) in particular dimension 1 conflicts 3 does not match 6; unable to unify: `Tensor[(3, 3, 3, 3), float32]` and `Tensor[(3, 6, 3, 3), float32]`; in particular dimension 1 conflicts 3 does not match 6; unable to unify: `Tensor[(2, 3, 4, 5), float32]` and `Tensor[(2, 6, 4, 5), float32]`; ;
E             let %seq.15 = (%seq.14, %seq.11, %seq.5);
E             %seq.15
E           }
E           // meta data omitted. you can use show_meta_data=True to include meta data

myia/compile/channel/__init__.py:148: LoadedError
========================================================================================= short test summary info ==========================================================================================
FAILED tests/frontends/test_pytorch_ops.py::test_torch_conv2d[relay-cpu-grad0] - myia.utils.serialize.LoadedError: Traceback (most recent call last):
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================================================ 1 failed in 17.73s ============================================================================================

The main error is in this line: let %seq.14 = nn.conv2d_transpose(%seq.3, %v_parameter26, channels=3, kernel_size=[3, 3], strides=[2, 3], output_padding=[1, 0], padding=[3, 2, 3, 2]) in particular dimension 1 conflicts 3 does not match 6; unable to unify: Tensor[(3, 3, 3, 3), float32]andTensor[(3, 6, 3, 3), float32]; in particular dimension 1 conflicts 3 does not match 6; unable to unify: Tensor[(2, 3, 4, 5), float32]andTensor[(2, 6, 4, 5), float32]; ;. I still don't know what causes this.

@abergeron
Copy link
Contributor

I'll have to take a look at the failure. I'll do that after I review the live infer PR.

@abergeron
Copy link
Contributor

My current impression is that there is a problem in the implementation of conv_transpose2d for relay (maybe only in certain cases).

I've made a patch to remove the bias argument since it's bad design. As for the root cause, I'm not 100% sure yet, and I may not have more time to look at it for a while since we are shifting to other projects for the time being.

@notoraptor
Copy link
Collaborator Author

Hi! I am still trying to fix conv_transpose2d in relay backend. After a bunch of tests, it seems to me that problems come from parameter output_padding when it's non-null.

I tested conv_transpose2d with same parameters but changing only output_padding with (0, 0), (1, 0) and (1, 1) (see pytest -v tests/frontends/test_pytorch_ops.py::test_conv_transpose2d in this branch: https://github.com/notoraptor/myia/blob/8dc356fbf53042eb8d91130bcbe2ff60030fb057/tests/frontends/test_pytorch_ops.py#L450 ), and it appears that:

  • with output_padding == (1, 0), some zero values were added in h_out. For example, if (h_out, w_out) == (5, 5), then the 5th row is filled with zeros, which is unexpected when compared to pytorch output
  • with output_padding == (0, 1), some zero values were added in w_out. For example, if (h_out, w_out) == (5, 5), then there is a zero at the end of each row, here also unexpected.
  • with output_padding == (1, 1), some zero values were added in both h_out and w_out, so that last row and last column are filled with unexpected zeros.
  • with output_padding == (0, 0), everything is computed normally.

So, I deduced that trailing rows and/or columns of computed input grad are filled with zeros when output padding is non-null. After checking TVM code, it seems that this zero-filling occurs here:

So, I guess that output padding is still not well handled in TVM backend, and now I am working on fixing it directly in TVM code. To do that, I am checking Theano implementation of conv grad input, which is here:

From what I understand now, output padding should be used to guess real input shape, because conv2d may produce same output shape for different input shapes when stride is not (1, 1), but TVM currently uses output padding only after internal conv2d has run. Theano, instead, receives input_shape instead of output padding as parameter, and seems to handle it before running internal conv2d.

I still need further reading to fully understand everything, but I hope I could ultimately fix it this week, at most.

@abergeron @fosterrath-mila @breuleux

@abergeron
Copy link
Contributor

See: apache/tvm#4318 for why the fix was reverted apparently.

Generalize pytorch backend tests to all backend.
Pass kwargs to eqtest through run.
Update eqtest for torch.Tensor, taking atol and rtol.
Replace bias with None and back to run_no_relay in test_torch_conv_transpose2d.
Add testing in both backends for:
- test_conv2d
- test_conv2d__non_tuple_args
- test_conv2d__group3
- test_conv_transpose2d
Apply flake8, pydocstyle myia, isort and black.
@notoraptor
Copy link
Collaborator Author

Hi! Still on conv2d_transpose.

I did not understand how to fix the bug addressed on old @abergeron TVM pull request related to conv2d_transpose ( here: apache/tvm#4318 (comment) ), so I decided to simply implement conv2d_transpose using Theano implementation, so that:

  • I could understand how conv2d_transpose is computed
  • Theano implementation works with all values for groups, dilation, strides, etc., thus, provides full support, while current TVM implementation is limited to groupes==1 and dilation=(1, 1)

As conv2d_transpose is ultimately a wrapper around conv2d, I started by implementing it in pure relay graph first. If necessary, I could later translate it directly into TVM code.

My current implementation is here: https://github.com/notoraptor/myia/blob/full-test-relay/tmp/shape_conv.py#L387

It's based on Theano implementation, which is fully extracted and cleaned here (works with numpy arrays, no theano symbols): https://github.com/notoraptor/myia/blob/full-test-relay/tmp/conv2d_transpose.py

I needed to add a relay operation dilate because I needed to dilate input tensor before passing it to conv2d. My TVM branch is here: apache/tvm@master...notoraptor:relay-op-dilate

Script https://github.com/notoraptor/myia/blob/full-test-relay/tmp/shape_conv.py can also be used to test Theano implementation (with parameter --theano) and relay graph implementation (with parameter --relay-graph). It uses Pytorch as reference implementation to compare values.

However, I currently face some hard bugs with, either segmentation fault, or script indefinitely running (I even need to manually kill process, as Ctrl+C won't work). Sometimes all tests in shape_conv.py with relay implementation pass, and sometimes I got those bugs, and it's a bit random. I don't yet know if it's something related to a memory leak or an access violation.

Theano implementation itself (in Python) seems to work as good as Pytorch, with various dilation, groups and strides. So, if I could fix these bugs, I guess we would have a good implementation for conv2d_transpose.

What do you think?

@abergeron @breuleux @fosterrath-mila

@abergeron
Copy link
Contributor

Is this PR still needed? If so, please clean it up so it is mergable, otherwise close it.

@notoraptor
Copy link
Collaborator Author

@abergeron I will close it, as now I already know which operations are missing in relay backend.

@notoraptor notoraptor closed this May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants