[Relay] Add a PyTorch to Relay Parser #4497

alexwong · 2019-12-11T01:13:07Z

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

Note: No need to review yet, just here for visibility

Originally submitted PR in a fork of TVM but putting one here as well. May close other one depending on outcome of discussion here https://discuss.tvm.ai/t/discuss-adding-a-pytorch-frontend/5026/4.

Support PyTorch natively in TVM by providing a relay parser. Like other frontends, grab the relay module and paramaters to build via:
mod, params = relay.frontend.from_pytorch(trace, input_shapes)

Tested against torchvision models in the included test_forward.py. Some discussion here: https://discuss.tvm.ai/t/discuss-adding-a-pytorch-frontend/5026/4

jwfromm

Overall this looks pretty good, I left a few comments / questions that might help clean some parts up.

jwfromm · 2019-12-13T00:00:31Z

tests/python/frontend/pytorch/test_forward.py

+
+# Test Functions
+def test_add1():
+    verify_model('Add1')


I think this would be much cleaner if we brought in the single_op definitions and grouped the tests into umbrella functions. For example, we could have test_add that looks something like:

def test_add(): input_data = torch.rand([10]).float() # test addition of two input tensors class Add(Module): def forward(self, *args): return args[0] + args[0] validate_model(Add(), input_data) # test constant addition class Add(Module): def forward(self, *args): return args[0] + 1 validate_model(Add(), input_data) ...

This structuring would make the style much closer to other frontend tests and allows more customized input data shape. Using 224x224x3 for simple operations isn't necessary and is probably a waste of time.

Just had another look at the other frontend tests and I agree. Will fix.

Currently having some issues getting the CI to pass with this change. Trying some things but the issue is that for the GPU test, we are running our of memory. Doesn't really make sense to me as this shouldn't be a functional change.

I wonder if its because Pytorch is trying to keep around all the models being built. You might be able to create a model, run a test, then use del model to force its cleanup.

tests/python/frontend/pytorch/test_forward.py

jwfromm · 2019-12-13T00:07:37Z

python/tvm/relay/frontend/pytorch.py

+        # Create corresponding shape and add to input
+        for input_name, ir_input in zip(self._input_shapes, ir_inputs[1:]):
+            input_shape = self._input_shapes[input_name]
+            tensor = tvm.nd.array(np.zeros(input_shape).astype(np.float32))


This script assumes that all data is float32 (not just here but in conversion functions as well). Is that needed or can we be more flexible with types? Presumably the torchscript nodes have something like a type attribute that can be used.

This was done for simplicity and because how we're using the parser in a separate project also assumes float32. Should be doable to be more flexible with typing though.

python/tvm/relay/frontend/pytorch.py

alexwong · 2020-01-06T23:58:17Z

Thanks for the review @jwfromm and sorry for the delay! Was busy with some other things and then the holidays happened!

masahi · 2020-01-08T09:13:45Z

@alexwong it seems CI is stuck after failing resnet test?

alexwong · 2020-01-08T19:10:50Z

@alexwong it seems CI is stuck after failing resnet test?

Yes, also some operator unit tests failed (batch_norm and dense). Slightly hard to debug as I can't seem to reproduce locally atm. But I need to write some unit tests for handling different types and this may catch it. Will see what is happening.

zhiics · 2020-01-08T23:12:24Z

@alexwong I see the GPU frontend column in CI executes more than 100mins for the tests (and it still runs). That may cause some problems. For example, it might be terminated as we have a time limit for it. Probably we don't want to test different variations of each network.

python/tvm/relay/frontend/pytorch.py

masahi · 2020-01-14T09:12:41Z

@alexwong I tried your PR locally. With pytorch v1.3 it works, but they introduced big change in #28408 and #28409, and it broke your PR (I may be wrong about which PR broke it). Below is how their IR looks like for resnet18 now. My pytorch version is '1.5.0a0+0dbd5c0' (the output of torch.version).

UPDATE: The breaking change might have come from #25089. Not sure if this commit was part of v1.3 release.

graph(%self.1 : __torch__.torch.nn.modules.module.___torch_mangle_66.Module,
      %input.1 : Float(1, 3, 224, 224)):
  %1452 : __torch__.torch.nn.modules.module.___torch_mangle_65.Module = prim::GetAttr[name="fc"](%self.1)
  %1449 : __torch__.torch.nn.modules.module.___torch_mangle_64.Module = prim::GetAttr[name="avgpool"](%self.1)
  %1448 : __torch__.torch.nn.modules.module.___torch_mangle_63.Module = prim::GetAttr[name="layer4"](%self.1)
  %1402 : __torch__.torch.nn.modules.module.___torch_mangle_47.Module = prim::GetAttr[name="layer3"](%self.1)
  %1356 : __torch__.torch.nn.modules.module.___torch_mangle_31.Module = prim::GetAttr[name="layer2"](%self.1)
  %1310 : __torch__.torch.nn.modules.module.___torch_mangle_15.Module = prim::GetAttr[name="layer1"](%self.1)
  %1273 : __torch__.torch.nn.modules.module.___torch_mangle_2.Module = prim::GetAttr[name="maxpool"](%self.1)
  %1272 : __torch__.torch.nn.modules.module.___torch_mangle_1.Module = prim::GetAttr[name="relu"](%self.1)
  %1271 : __torch__.torch.nn.modules.module.___torch_mangle_0.Module = prim::GetAttr[name="bn1"](%self.1)
  %1265 : __torch__.torch.nn.modules.module.Module = prim::GetAttr[name="conv1"](%self.1)
  %1528 : Tensor = prim::CallMethod[name="forward"](%1265, %input.1)
  %1529 : Tensor = prim::CallMethod[name="forward"](%1271, %1528)
  %1530 : Tensor = prim::CallMethod[name="forward"](%1272, %1529)
  %1531 : Tensor = prim::CallMethod[name="forward"](%1273, %1530)
  %1532 : Tensor = prim::CallMethod[name="forward"](%1310, %1531)
  %1533 : Tensor = prim::CallMethod[name="forward"](%1356, %1532)
  %1534 : Tensor = prim::CallMethod[name="forward"](%1402, %1533)
  %1535 : Tensor = prim::CallMethod[name="forward"](%1448, %1534)
  %1536 : Tensor = prim::CallMethod[name="forward"](%1449, %1535)
  %1182 : int = prim::Constant[value=1]() # /home/masa/anaconda3/lib/python3.7/site-packages/torchvision-0.5.0a0+07cbb46-py3.7-linux-x86_64.egg/torchvision/models/resnet.py:210:0
  %1183 : int = prim::Constant[value=-1]() # /home/masa/anaconda3/lib/python3.7/site-packages/torchvision-0.5.0a0+07cbb46-py3.7-linux-x86_64.egg/torchvision/models/resnet.py:210:0
  %input : Float(1, 512) = aten::flatten(%1536, %1182, %1183) # /home/masa/anaconda3/lib/python3.7/site-packages/torchvision-0.5.0a0+07cbb46-py3.7-linux-x86_64.egg/torchvision/models/resnet.py:210:0
  %1537 : Tensor = prim::CallMethod[name="forward"](%1452, %input)
  return (%1537)

alexwong · 2020-01-14T21:31:28Z

@alexwong I tried your PR locally. With pytorch v1.3 it works, but they introduced big change in #28408 and #28409, and it broke your PR (I may be wrong about which PR broke it). Below is how their IR looks like for resnet18 now. My pytorch version is '1.5.0a0+0dbd5c0' (the output of torch.version).

graph(%self.1 : __torch__.torch.nn.modules.module.___torch_mangle_66.Module,
      %input.1 : Float(1, 3, 224, 224)):
  %1452 : __torch__.torch.nn.modules.module.___torch_mangle_65.Module = prim::GetAttr[name="fc"](%self.1)
  %1449 : __torch__.torch.nn.modules.module.___torch_mangle_64.Module = prim::GetAttr[name="avgpool"](%self.1)
  %1448 : __torch__.torch.nn.modules.module.___torch_mangle_63.Module = prim::GetAttr[name="layer4"](%self.1)
  %1402 : __torch__.torch.nn.modules.module.___torch_mangle_47.Module = prim::GetAttr[name="layer3"](%self.1)
  %1356 : __torch__.torch.nn.modules.module.___torch_mangle_31.Module = prim::GetAttr[name="layer2"](%self.1)
  %1310 : __torch__.torch.nn.modules.module.___torch_mangle_15.Module = prim::GetAttr[name="layer1"](%self.1)
  %1273 : __torch__.torch.nn.modules.module.___torch_mangle_2.Module = prim::GetAttr[name="maxpool"](%self.1)
  %1272 : __torch__.torch.nn.modules.module.___torch_mangle_1.Module = prim::GetAttr[name="relu"](%self.1)
  %1271 : __torch__.torch.nn.modules.module.___torch_mangle_0.Module = prim::GetAttr[name="bn1"](%self.1)
  %1265 : __torch__.torch.nn.modules.module.Module = prim::GetAttr[name="conv1"](%self.1)
  %1528 : Tensor = prim::CallMethod[name="forward"](%1265, %input.1)
  %1529 : Tensor = prim::CallMethod[name="forward"](%1271, %1528)
  %1530 : Tensor = prim::CallMethod[name="forward"](%1272, %1529)
  %1531 : Tensor = prim::CallMethod[name="forward"](%1273, %1530)
  %1532 : Tensor = prim::CallMethod[name="forward"](%1310, %1531)
  %1533 : Tensor = prim::CallMethod[name="forward"](%1356, %1532)
  %1534 : Tensor = prim::CallMethod[name="forward"](%1402, %1533)
  %1535 : Tensor = prim::CallMethod[name="forward"](%1448, %1534)
  %1536 : Tensor = prim::CallMethod[name="forward"](%1449, %1535)
  %1182 : int = prim::Constant[value=1]() # /home/masa/anaconda3/lib/python3.7/site-packages/torchvision-0.5.0a0+07cbb46-py3.7-linux-x86_64.egg/torchvision/models/resnet.py:210:0
  %1183 : int = prim::Constant[value=-1]() # /home/masa/anaconda3/lib/python3.7/site-packages/torchvision-0.5.0a0+07cbb46-py3.7-linux-x86_64.egg/torchvision/models/resnet.py:210:0
  %input : Float(1, 512) = aten::flatten(%1536, %1182, %1183) # /home/masa/anaconda3/lib/python3.7/site-packages/torchvision-0.5.0a0+07cbb46-py3.7-linux-x86_64.egg/torchvision/models/resnet.py:210:0
  %1537 : Tensor = prim::CallMethod[name="forward"](%1452, %input)
  return (%1537)

I can take a look again in the next few days. Will probably move to at least support PT 1.4 which was just released (or sometime in the next few days) and may have those IR changes as well. Some to-do's remaining are making sure different types are working, cleaning up the tests based off @jwfromm comments, and updating the parser to work for 1.4>.

masahi · 2020-01-14T22:27:31Z

I'm happy to merge this even if we can only support v1.3 models for now. I want to send other op converters I need.

masahi · 2020-01-15T06:33:29Z

ok I found a dirty hack to remove prim::CallMethod. Add
torch._C._jit_pass_inline(trace.graph)
after tracing. This will move operators hidden inside each prim::CallMethod into top level.

python/tvm/relay/frontend/pytorch.py

masahi · 2020-01-22T08:21:26Z

@alexwong I've sent a PR #4756 which updates PyTorch on CI to 1.4. You can add torch._C._jit_pass_inline(trace.graph) after tracing.

btw, PyTorch on CI is v1.2. This may be one of the reasons you couldn't repro CI errors locally.

python/tvm/relay/frontend/pytorch.py

masahi · 2020-02-25T02:57:57Z

@icemelon9 a logic and a test case for channel_mulitplier > 1 added.
@FrozenGene please review the change you requested.

icemelon

lgtm

icemelon · 2020-02-25T04:14:57Z

Thanks everyone. This is now merged

* Add a PyTorch to Relay parser * Add alexnet, googlenet, mnasnet, shufflenet wip * Fix lint * Remove fix for shufflenet * Lower check * Pull changes from neo-ai/tvm changes * Remove commented out section * Use infer_shape everywhere * Change back to using trace instead of path in from_pytorch * Parse state_dict to add param names * Umbrella single_op under test_forwards * Remove print and cleanup call * Check if update to test broke CI * Retrigger CI * Add back in updated tests * Try splitting up tests * First pass at flexible typing, implemented for ones * Add int32 for all ops * Remove print statements * Fix lint * Broad except * Add other tensor types * Temporarily use old tests * Retrigger CI * Lower type names * Use numpy to convert in dense op * Fix lint * Remove print * Need to cleanup but verify int32 works for add * Rough tests for different types, a lot of types are not supported on CPU * Probably doesn't build, need to save work as I have to switch branches (constantly) * Parse param type * Remove print stmt in parser * Clean up some code * Working on flaot32 for bn * Add resnet18 double type * Fix lint * Temporarily move PT tests first * Temporarily add back refactored tests to fix mem issue * Add more type test and temp remove some tests * Comment out tests, hopefully CI prints a trace * Get stack trace * Remove operator dict key, rename op_name to node_id, remove dead code * Make relay map a list * Remove some hacky string stuff * Move to PyTorch 1.4 * Remove input_type as param * Remove _get_fill_value, fix full ops * Remove unused code and combine ops for identity and none * Remove fn_param * Clean up main loop * Remove useless if/else for outputs * Remove ir_names, only used once * Remove some string hacking * Remove string parsing to get output name * Fix bug with output sizes of nodes * Use attributeNames in parse ops * Remove continue and add_op in parse_op * Do this everywhere, use assert instead of explciitly type casting * Remove unnecessary swap * Slight refactor for elemwise input parse * Use a copy of graph everywhere * Rename nid_to_node_name * Refactor parse import prereqs * Clean up input node kind check * Clean up conditionals * Clean up add_op * Cleanup type for ones and zeros op * Fix lint * Add torch install to CI * Actually use torch * Try moving import torch to only where it's needed * Import torch for CI * Use take op for select * Temporarily add ignore for jit inline pass for CI * Use CompleteTensorType, might be a PT 1.2 only thing * Use different types in elemwise op * Use float16 ones * Fix float16 test * Remove the temp docker changes * Remove temp test * Temporarily comment out original tests * Remove file * Empty cache after each test * Add some prints and lower input sizes * Try using no grad * Trying to globally set grad off * Use no grad for torchvision * Remove xfail tests * Remove VGG and AlexNet due to some issues * Combine pooling tests * Remove extra test file * Remove single op, remove larger pooling tests * Remove maxpool3 * Remove debug prints * Remove inference call and add no_grad in measure latency * Use standard string start char * Remove redundant infer_shape in slice * Convert most to checks to just expr * Remove extra paren * More refactor of isinstance * Add helper for creating typed constants * Assert instead of return when no matching type * Remove network variants * Add no_grad when forward, remove deatch, fix lint * Change isinstance to expr in transpose * Use opnotimplemented, refactor * Fix full ops, remove duplicate tests * Never use shape field unless we know the type * Remove comma, retrigger CI * Add paren, retrigger CI * Use inline if-else for flags * Throw exception instead of assert * Remove version check for CI * Check version when doing inline pass * Fix lint * Lower more input sizes * Add new line, conv2d only accepts weight as expr * Use tvm.runtime.ndarray * Remove change to torch version install * Try no grad for mobilenet * Fix lint * Fix lint again * Revert to last passing * Delete test files * Ignore lint * Revert back * Comment out mobilenet * Clean up compare compiled and baseline outputs * Use IRModule * Add todos * Refactor use_bias * Add todo for fix conv op channels * Change input to data type * Remove todo * Handle channel multiplier > 1

masahi self-assigned this Dec 13, 2019

jwfromm requested changes Dec 13, 2019

View reviewed changes

alexwong commented Dec 19, 2019

View reviewed changes

python/tvm/relay/frontend/pytorch.py Show resolved Hide resolved

masahi reviewed Jan 9, 2020

View reviewed changes

python/tvm/relay/frontend/pytorch.py Outdated Show resolved Hide resolved