Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Improve CUDA softmax scheduling #5600

Merged
merged 1 commit into from
May 25, 2020
Merged

Conversation

wpan11nv
Copy link
Contributor

  • Do not use multiple kernels

  • Schedule with warp reductions

  • Fixed a bug on the lower warp memory pass

  • Fixed warp shuffle intrinsics for the nvptx backend.

Signed-off-by: Wei Pan weip@nvidia.com

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@wpan11nv wpan11nv force-pushed the softmax branch 4 times, most recently from 0d8bc74 to eb40467 Compare May 15, 2020 21:30
@wpan11nv
Copy link
Contributor Author

wpan11nv commented May 16, 2020

Please help review this PR, @tqchen and @roastduck

Copy link
Contributor

@roastduck roastduck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/target/llvm/codegen_llvm.cc Outdated Show resolved Hide resolved
@tqchen
Copy link
Member

tqchen commented May 18, 2020

@icemelon9 @Hzfengsy @Shawn-Inspur please also take a look

@wpan11nv wpan11nv force-pushed the softmax branch 2 times, most recently from 20fd710 to b29f74f Compare May 21, 2020 16:27
@tqchen
Copy link
Member

tqchen commented May 23, 2020

Copy link
Contributor

@Shawn-IEITSystems Shawn-IEITSystems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wpan11nv
Copy link
Contributor Author

This looks like a doc error. Let me trigger the build again.

- Do not use multiple kernels

- Schedule with warp reductions

- Fixed a bug on the lower warp memory pass

- Fixed warp shuffle intrinsics for the nvptx backend.

Signed-off-by: Wei Pan <weip@nvidia.com>
@wpan11nv
Copy link
Contributor Author

This build is fine now and all comments are addressed. Thanks all for reviewing this patch!

@tqchen tqchen merged commit 7b74a86 into apache:master May 25, 2020
@tqchen
Copy link
Member

tqchen commented May 25, 2020

THanks @wpan11nv , @roastduck @Shawn-Inspur !

@t-vi
Copy link
Contributor

t-vi commented Jun 3, 2020

This broke the ROCm backend.

@tqchen
Copy link
Member

tqchen commented Jun 3, 2020

@t-vi can you elaborate?

@wpan11nv please also followup. One thing that get missed during review is that we should move the nvptx related intrinsics into codegen_nvptx.cc instead.

@@ -39,6 +39,7 @@ def schedule_softmax(outs):
outs = [outs] if isinstance(outs, te.tensor.Tensor) else outs
s = te.create_schedule([x.op for x in outs])
softmax = outs[0]
tgt = target_.Target.current(allow_none=False)
Copy link
Member

@tqchen tqchen Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to register the warp level strategies only when the target is cuda, given that the "gpu" schedule is reused by other GPUs that does not support warp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I'd like to benefit from the GPU schedules with warps...

@t-vi
Copy link
Contributor

t-vi commented Jun 3, 2020

So ROCm uses the CUDA schedule, but warp reductions don't seem to currently work (so arguably, ROCm would want to be improved). But so before this PR, one could run resnet18 with rocm backend and now one cannot.
This might also be seen earlier, when running the tests on ROCm.
I've looked a bit into fixing it, but I haven't fully understood from which of the three related patches this stems.
(Incidentally, it triggered also a corner case for me on cuda where nvrtc would accidentally use an cuda-8.0 instead of the 10.1 the libnvrtc belonged to.)

@wpan11nv
Copy link
Contributor Author

wpan11nv commented Jun 3, 2020

So ROCm uses the CUDA schedule, but warp reductions don't seem to currently work (so arguably, ROCm would want to be improved). But so before this PR, one could run resnet18 with rocm backend and now one cannot.
This might alsscheduleo be seen earlier, when running the tests on ROCm.
I've looked a bit into fixing it, but I haven't fully understood from which of the three related patches this stems.
(Incidentally, it triggered also a corner case for me on cuda where nvrtc would accidentally use an cuda-8.0 instead of the 10.1 the libnvrtc belonged to.)

Can we disable this ROCm here?
https://github.com/apache/incubator-tvm/blob/master/topi/python/topi/cuda/softmax.py#L60

@t-vi
Copy link
Contributor

t-vi commented Jun 3, 2020

I'll just work on a fix.

@wpan11nv
Copy link
Contributor Author

wpan11nv commented Jun 3, 2020

I'll just work on a fix.

Thanks! Will you fix it or just disable this schedule? Let me know if you need any help to enable it.

@t-vi
Copy link
Contributor

t-vi commented Jun 4, 2020

I'm adding shfl intrinsics to the rocm bits (using tvm.intrin.rule.rocm.tvm_warp_shuffle /-up/-down definitions).
I'll probably run into the nvptx bits in the llvm codegen. Is there a reason not to use the intrin.rule mechanism for nvptx?
I'm not sure running gpu_imagenet_bench.py (which I'm using as the first stop of seeing if anything works) with the nvptx target works for me (though I get to the codegen for that), but I would not know if it worked before...

t-vi added a commit to t-vi/tvm that referenced this pull request Jun 4, 2020
See discussion in apache#5600.

I'm also throwing in a pointer lifetime fix for the context held by
NVPTX because otherwise topi/tests/python/test_topi_softmax.py
would sefault for me. With the test, I can also run resnet-18 on
the nvptx target in gpu_imagenet_bench.py.
t-vi added a commit to t-vi/tvm that referenced this pull request Jun 4, 2020
See discussion in apache#5600.

I'm also throwing in a pointer lifetime fix for the context held by
NVPTX because otherwise topi/tests/python/test_topi_softmax.py
would sefault for me. With the test, I can also run resnet-18 on
the nvptx target in gpu_imagenet_bench.py.
t-vi added a commit to t-vi/tvm that referenced this pull request Jun 4, 2020
See discussion in apache#5600.

I'm also throwing in a pointer lifetime fix for the context held by
NVPTX because otherwise topi/tests/python/test_topi_softmax.py
would sefault for me. With the test, I can also run resnet-18 on
the nvptx target in gpu_imagenet_bench.py.
@t-vi
Copy link
Contributor

t-vi commented Jun 4, 2020

@wpan11nv Thanks for your offer to help. I submitted the clean-up #5726 and then in #5727 I add ROCm warp reductions. One of the things I did was to avoid assuming a fixed warp-size of 32 in the TIR transformations before codegen.
Thank you for improving softmax btw - it was something that looked funny with the four kernels before.

tqchen pushed a commit that referenced this pull request Jun 4, 2020
…tx (#5726)

See discussion in #5600.

I'm also throwing in a pointer lifetime fix for the context held by
NVPTX because otherwise topi/tests/python/test_topi_softmax.py
would sefault for me. With the test, I can also run resnet-18 on
the nvptx target in gpu_imagenet_bench.py.
@tqchen
Copy link
Member

tqchen commented Jun 5, 2020

cc @icemelon9 it might be useful to revisit the softmax strategy, given that the perf has been improved

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jun 9, 2020
- Do not use multiple kernels

- Schedule with warp reductions

- Fixed a bug on the lower warp memory pass

- Fixed warp shuffle intrinsics for the nvptx backend.

Signed-off-by: Wei Pan <weip@nvidia.com>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jun 9, 2020
…tx (apache#5726)

See discussion in apache#5600.

I'm also throwing in a pointer lifetime fix for the context held by
NVPTX because otherwise topi/tests/python/test_topi_softmax.py
would sefault for me. With the test, I can also run resnet-18 on
the nvptx target in gpu_imagenet_bench.py.
@icemelon
Copy link
Member

That's great. I'll take a look at it.

trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jun 16, 2020
* [TFLITE]Select op support for tflite frontend (#5486)

* [TFLITE]Select/Where op support for tflite frontend

* Review comment fixed

* Review comment fixed

* [FRONTEND][TFLite] Fully connected op conversion made in sync with TFLite (#5510)

* [FRONTEND][TFLite] Fully connected op conversion made in sync with TFLite

* [1] Test case added

* [2] Review comments handled

* [3] Prints removed

* [TOPI][Winograd] Optimization of Conv2d Winograd algorithm on Tensor Core (#5485)

* Cache PrimExpr instead of raw pointers in bound analyzer (#5533)

The objects that the raw pointers point to can be deallocated and new
objects can be allocated at the same address, all while these pointers
are still in the cache. This can lead to unexpected behavior, for
example to calculated bound conflicts with previously cached values.

Caching PrimExpr will prevent the objects from being deallocated while
the cache is active.

* fix a few bugs with shape inference and types in the onnx importer (#5534)

* [Frontend][TFLite] ADD_N operator  (#5474)

* [WEB][RUNTIME] TVM WebAssembly JS Runtime (#5506)

* [WEB] Remove the old web runtime

* [WEB][RUNTIME] TVM WebAssembly Runtime

This PR introduces a brand new TVM web runtime based on the WASM standard API.
Main highlights:

- The new runtime is rewritten using the Typescript.
- The new runtime now directly interfaces with WebAssembly's standard API,
  instead of relying on emscripten's API.
  This change will make the js runtime more portable to runtime variants.
  For example, we could also try to make it interface with the tvm's rust runtime implementation.
- System library can be provided through WASI
  - We also build a hack to enable Emscripten to generate a WASI like
    bundle for runtime environment on the Web.
- The wasm generation now uses the mainlin LLVM.
- Dynamic link(dlopen) is not used due to limitation of wasm,
  instead we rely on the recent new RPC refactor to directly
  restart a new session for each wasm binary sent to the RPC.

* Address review comments

* Skip tensorcore test

* [RELAY][ONNX]ReduceLogSumExp Operator support (#5453)

* [RELAY]LogSumExp Op Support

* [ONNX]LogSumExp Op Support

* [RPC][BUGFIX] Fix remote device sync (#5538)

* [Refactor][std::string --> String] IRModule is updated with String (#5523)

* [std::string --> String] IRModule is updated with String

* [1] Packedfunction updated

* [2] Lint error fixed

* [3] Remove std::string variant

* [RUNTIME] Store nullptr PackedFunc as nullptr for better error propagation (#5540)

* [Relay-TFLite] FP32 and Quantized Object Detection Model (#5479)

* TFlite e2e FP32 Object detection model

* Fix test

* [Relay-TFLite] Quantized activations

* Flexbuffer parsing

* Lint

* Relaxing checks.

* Github reviews

* comments

Co-authored-by: Ubuntu <ubuntu@ip-172-31-34-212.us-west-2.compute.internal>

* Changes to cpp_rpc to make it work on Android (+ Hexagon offloading) (#5535)

* Changes to cpp_rpc to make it work on Android (+ Hexagon offloading)

- Implement getNextString to break up std::string into words. stringstream
  just doesn't work on Android.
- string::find_last_of doesn't look for the last substring, but the
  last character from a given string.
- Use SIGTERM to terminate processes (this isn't necessary, but using
  SIGKILL is not a good practice).
- Convert "./rpc" to a full path. When a module is uploaded and offloaded
  to Hexagon, the dlopen on Hexagon needs an absolute path (or a path
  without directories).

* Only set the absolute patch on non-Windows platforms

Windows has different macros for the maximum path length.

* Add Onnx Pad v11 (#5539)

* fix restructured text (#5541)

* [CRT]fix to reduce RAM size during loading model (#5507)

* [CRT]fix to reduce RAM size during loading model

* Release graph_json memory immediately after reading

* Load platform specific lib for tvmdsoop instead of only so (#5542)

* [RPC] Improve RPCServer AsyncIO support. (#5544)

* [RPC] Improve RPCServer AsyncIO support.

When the RPCServer is in the async IO mode, it is possible for the server
to directly serve async function that may return its value via a callback in the future.
This mode is particular useful to the web environment, where blocking is not an option.

This PR introduces the Async support to the RPCSession, allowing the AsyncIO driven servers
to serve the async functions. These functions will still be presented as synchronized version
on the client side.

Followup PR will refactor the web runtime to make use of this feature.

* Address comments

* [Rust] Add first stage of updating and rewriting Rust bindings. (#5526)

* Add tvm-sys

* Use as_mut_ptr

* Address CR feedback

* Update rust/tvm-sys/src/datatype.rs

Co-authored-by: Nick Hynes <nhynes@berkeley.edu>

* Final CR comments

* Fix find and replace error in frontend

Co-authored-by: Nick Hynes <nhynes@berkeley.edu>

* [TE] Fix MakeLoopNest for warp memory (#5382)

* [TIR][Printer] text format printer considering future parsing use (#5483)

* [Optimization] Warp level reduction support for CUDA (#5498)

- Added the warp level reduction support

- Upgraded shfl intrinsics to the sync version.

- This is the building block for scheduling softmax like operations.

Signed-off-by: Wei Pan <weip@nvidia.com>

* A clone of test/python/unittest/test_runtime_micro.py, however (#5546)

modified to run specifically on ARM cortex-M hardware, which
currently is just the STM32F746 discovery board.

Signed-off-by: Tom Gall <tom.gall@linaro.org>

* [CI] Install wasmtime for WebAssembly tests (#5494)

* Apparently, ONNX Conv with no 'pads' defaults to zero padding (#5548)

* [WEB] WebGPU support (#5545)

This PR introduces WebGPU support to tvm.
The WebGPU runtime is directly built in javascript(as WebGPU uses JS as the first class citizen API)
and exposes back to the tvm's runtime via PackedFuncs.

One important note is that `ctx.sync` is not async.
This is due to the fact that WebGPU is a purely async API and we cannot block in the web environment.

So the current best way to use the js api is to wrap things in an async function.
When copy a GPU array to CPU, `await ctx.sync()` need to be called to wait for copy completion.

We use a AsyncIO rpc server to serve the async functions to the clients.

* [TOPI][RELAY][TENSORFLOW]Math ops added (#5502)

* [TOPI][RELAY][TENSORFLOW]Math ops added

* Extra newline removed

* CI fix

* Review comments fixed

* Review comments fixed

* [RUNTIME] Hexagon driver for offloading kernels to simulator (#5492)

* [RUNTIME] Hexagon driver for offloading kernels to simulator

* Add sim_dev as external project when building with Hexagon/sim support

* Change target CPU for sim_dev to v60

* [LINT] clang-format the h,cc,m files. (#5557)

This PR prepares for our migration to use the clang-format
as part of the linter system.

* [BYOC, MergeComposite] Add additional check before re-using the cached match (#5552)

* Add additional check before re-using the cached match in merge composite

* clean up ExtractPattern calls

* [WEB] Setup lint, doc, test (#5556)

* [CI] Update ci-cpu to bionic (#5555)

* [CI] Update ci-cpu to bionic (#5554)

* [Fix] Fix conv2d alter op for arm cpu (#5532)

* [FRONTEND]onnx, mxnet, pytorch mathops added (#5561)

* Fix topi test for tensorcore (#5563)

* [Refactor][std::string --> String] IR is updated with String (#5547)

* [std::string --> String] GlobalTypeVar is updated with String

* [std::string --> String] GlobalVar is updated with String

* [std::string --> String][IR] ADT is updated with String

* [std::string --> String][IR] OP is updated with String

* [std::string --> String][IR] Attrs is updated with String input

* [std::string --> String][IR] GlobalVar is updated with String

* [std::string --> String][Test] Pyconverter is updated with String change

* [DOCKER] Fix vulkansdk in the ci-gpu (#5566)

* [CI] reintroduce docker stage for wasm tests (#5565)

* [DOCKER] Introduce ci-wasm

* Add Jenkinsfile

* Rename prepare to prepwasm so it won't run by default

* [CI] Update ci-lint to use the latest image that contains clang-format (#5568)

* [DOCKER] Add clang-format and nodejs to ci-lint (#5567)

* [TARGET] Phase out WebGL (#5570)

The graphics API is moving towards next generation.
Vulkan/Metal on the native and WebGPU on the web.

Due to the limited programming model, we cannot get the best compute performance in WebGL.
Now that the mainline already have both WebGPU and vulkan support, this PR phases out WebGL.

* [LINT] Enable clang-format. (#5572)

* [LINT] Enable clang-format.

* Add more docs

* [CI] Update the ci-gpu to the lastest build with the new vulkansdk. (#5571)

* [Relay] enable blocking format in x86 conv2d and fold scale axis (#5357)

* [CI] Fix clang-format error (#5577)

* Allow ubuntu_install_darknet.sh to work in both 18.04 and 16.04 (#5574)

* [PYTORCH]expand bug fix (#5576)

* [CI] Enable llvm-11 and llvm-10 in build tests, recover webdocs. (#5579)

This PR ties up the last loosen end of the recent CI update.

* [PYTORCH] Support max_pool2d_with_indices (#5549)

* Use real output name instead of node_name

* Add pytorch max_pool2d_with_indices converter.

* Add test for maxpool2d with indices

* Add explicit assert for single output

* Only consume output (not indices) from max pool 2d with indices

* undo change

* [Relay] Fixed bug in attribute parsing for pool layers. (#5582)

* Fixed pooling bug.

* Added tests and fixed more cases.

* [RELAY][TF] Support symbolic newshape for Reshape (#5429)

* [RELAY][TF] Support symbolic newshape for Reshape

* Only need to pass data

* Use MakeReshape() in Reshape()

* Change newshape to Expr

* Create a template for Array<T>

* Fuse reshape when newshape is constant

* Make newshape Optional

* Use bool() of Optional

Co-authored-by: Li Xiaoquan <xiaoquan.li@denglin.ai>

* Add prim::device op (#5584)

* Fix the runtime raise error (#5586)

* [RELAY][Convert Layout] Specify additional layouts in convert layout pass (#5422)

* [RELAY] Specify additional layouts in convert layout pass

* This patch means that you can specify an additional layout, rather than using the layout chosen by default during conversion.
* This is specifically useful for external codegen when a 3rd party library needs to target a specific kernel layout for example.

Change-Id: I3ef9cf45ead574801870a38af9768f93e29aab10

* Use mapping of op name to list of desired layouts

Change-Id: Ibd691a3cb93e73a394f36112668ad52a84c7d5a2

* Fix issue with code block

Change-Id: Ibb4e38c05ad4312b7dea845be699b8d5d57e0a94

* Address comments, Improve tutorial

Change-Id: Ib824eead329d551c338234de3b2d814693afd0ec

* Fix linting

Change-Id: Ie9e1891f590b3a7496a56ff8362cdda9d4b5fa75

* Test uses NCHW default layout. Unrelated issue with NHWC.

Change-Id: I1c16f0db73db56f5e9536db3fe5eb2624c3b595c

* Fix mistake in tutorial

Change-Id: I944041245d27af262dc96f1cd8117f1f19272062

* Address multiple comments

Change-Id: If33a1e34acd8fc37d1c7797ee189a6448a392672

* Improve tutorial

Change-Id: Ib04142c94c7958ab5067947d2ff4c84354e3d0c5

* Fix Clang-format

Change-Id: Ieff39e3f0817d22579c68b3287e972a3b0fcfbc8

* Add a quantized conv2 unit test for the tflite front-end (#5558)

Signed-off-by: Giuseppe Rossini <giuseppe.rossini@arm.com>

* [Relay][Transform] Safe check added for Merge Composite (#5562)

* [MXNET]abs, round, reciprocal, sign, softsign, hard_sigmoid (#5587)

* [Hexagon] One more fix for concurrency count (#5589)

* Fix JSON graph dumping. (#5591)

* Previously this function placed a JSON-escaped string containing
   the JSON-encoded graph.

* [DOCS] Improve document in reflection (#5593)

* Overestimate binary size for microTVM compiled binaries. (#5590)

* Overestimate binary size for microTVM compiled binaries.

 * Currently uTVM binary section sizes are computed by summing the
   sizes of all symbols in the section.
 * This method produces errors because it presumes the linker works in
   a particular way, rather than analyzing the linked output.
 * As we intend to move away from linking inside TVM (RFC
   forthcoming), just using this stopgap to make forward progress
   until then.

* address weberlo comments

* fix regression (use 64 bit word size)

* [TFLite Runtime] Fix bug and re-enable RPC execution test (#5436)

* [Relay][VM] Memory planner (part 1) (#5144)

* Start on memory planning

WIP

Move to test_memory_passes.py

Work on memory planning

Post-rebase and VM changes

Plumb through the offsets

Basic tests all pass, fix offset to data buffer.

Fix compile errors

Fix ws

Apply suggestions from code review

Co-Authored-By: Haichen Shen <shenhaichen@gmail.com>

Address CR

Update src/runtime/vm/vm.cc

Co-Authored-By: Haichen Shen <shenhaichen@gmail.com>

Fix another comment

Fix lint

Fix

Fix

Fix

Lint is done?

Fix

More fix

Trying to debug

No clue

Fix lint

* Fix docs

* Disable aggressive constant eval

* It works

* Fix lint

* Found issue with dynamic

* Fix the pass, but runtime segfaults

* fix scalar tensor, test_any_elemwise passes

* Fix split pass

* Fix 0-rank issues

* Fix

* debug

* apply Haichen's patch and clean up

* lintgit add .

* fix serializer and test_tyck_alloc_tensor test

* Fix the constant lift pass in presence of closures

* Restore old finder

* Fix rebase issues

* Fix

* Fix

* Fix issue coercing the shapes incorrectly from i64 to i32

* Fix linting

* Fix clang format

* Format memory.cc

* Fix 0-rank case

* Add fix for (0,) shape

* Ignore shapes for now

* Apply suggestions from code review

Co-authored-by: Zhi <5145158+zhiics@users.noreply.github.com>

* Update src/runtime/vm/executable.cc

Co-authored-by: Zhi <5145158+zhiics@users.noreply.github.com>

* Fix

* lint

Co-authored-by: Zhi Chen <chzhi@amazon.com>
Co-authored-by: Zhi <5145158+zhiics@users.noreply.github.com>

* Add ostream formatters for TargetPtr/TargetVal. (#5592)

* Pattern Language, Matcher, Rewriter, and Function Paritioner (#5231)

* [Reduction] Fix cross thread redunction (#5551)

- The predictions were not correctly applied after transformation.
  This leads to normal reduction itervar appearing outside of the loop,
  which is undefined. See detailed comments.

Signed-off-by: Wei Pan <weip@nvidia.com>

* Fix TVMArray layout on device (#5599)

* [LLVM] Represent alignment information in LLVM IR (#5598)

* Add debug mode to tempdir() (#5581)

* [PYTORCH]ImplicitTensorToNum support added (#5603)

* [PYTORCH]Matmul fix for batch_matmul (#5604)

* fix rpc server bug on VTA (#5607)

* [REFACTOR][IR] Streamline ir/op Registry (#5609)

* [REFACTOR][IR] Streamline ir/op Registry

This PR refactors the attrregistry mechanism in the ir/op into
a separate common base. The common base will provide a foundation
for other attr related registries such as target and pass.

We also streamlines the terminology of the registry API.

- Use AttrMap for the column maps returned by the registry
- Use RegEntry to refer to the registry entry.

* Address review comments

* [TFLITE]GATHER_ND (#5508)

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* [CUDA] Fix codegen for warp shuffle intrinsics (#5606)

* fix shfl intrin

* improve test_lower_warp_memory_cuda_half_a_warp

* Fix a typo. (#5611)

Co-authored-by: Zeng Liyong <liyong.zeng@streamcomputing.com>

* fix pattern topological order (#5612)

* [BYOC] Remove kCompiler attr from external functions (#5615)

Functions destined for external codegen keep their kCompiler attribute which means SkipFunction returns true when running a pass over such functions during the codegen step. This makes sense during graph partitioning, however when lowering the functions for codegen the is no reason to keep this behaviour.

Allowing this behaviour will mean a codegen can run a pass on functions only intended for one 3rd party library. Specifically, allowing pre-processing of a series of sub-graphs right before it is passes through codegen. This helps ensure that the functions destined for the 3rd party library are in the expected format. For example, we may want to ensure that these functions have a kernel layout of OHWI because the 3rd party library only supports OHWI. This wouldn't be possible before partitioning the graph as we don't know how the graph will be partitioned ahead of time.

Change-Id: Ia68b9da335ef1acfc405a8528aac823de60a65c2

* [Relay]Improve Shape Func handling for Tuple inputs (#5467)

* Improve Shape Func handling for Tuple inputs

* Fix lint

* Improve

* Fix build

* [Relay][Refactor][std::string --> String] Relay updated with String (#5578)

* [KERAS]Global MaxPool3d and AvgPool3d support (#5098)

* [IOS] Fix build error of iOS RPC (#5621)

* [IOS] Fix build error of iOS RPC

- Update to C++14
- Use the latest RPC protocol
- Resolve CoreML dependency

* Fix clang-format error

* Fix three typos (#5620)

Co-authored-by: Zeng Liyong <liyong.zeng@streamcomputing.com>

* [Frontend][Tensorflow] Gather nd bug fix for one dim support in tensorflow (#5588)

* [Frontend][Tensorflow] Gather_nd one dim support added

* Test case added

* Doc error handled

* Review comment handled: reverting new attr introduced

* Check added at mxnet frontend

* Doc error handled

* TFLite test case failure resolved

* [MXNET]MaxPool3d and AvgPool3d Ops support added (#5614)

* [PYTORCH]ReflectionPad2d op (#5624)

* [BYOC][MergeComposite] if root->args[i] isn't a CallNode, then Donwcast<Call> will check fail (#5623)

we needn't execute L131 "call_map->Set(arg, new_arg)", because when arg
is CallNode and root->args[i] is not CallNode, new_arg will be a null
pointer. There is no point in caching null pointer.

Signed-off-by: windclarion <windclarion@gmail.com>

* [DOCS] Move the api docs to the api subfolder (#5626)

* [DOCS] Move the api docs to the api subfolder

* Update numpydoc location

* Ignore 403

* make sure folder exists

* [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph (#5616)

* [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph

If the annotated compiler region contains multiple outputs where
some of the outputs are tuple output, the current PartitionGraph will
create tuple of tuples. This will not be handled by the runtime.
This commit flattens the such tuples and re-create them after the
call site of the partitioned function.

Change-Id: I4e7ccbda73c129a9f4ae8705d5c9f2af6ab99ef6

* [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph

    *code refactor : extracted the passes as a sequential

Change-Id: If4bc00b00a96fa244358d602fc1a361498342f46

* [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph
   *further refactor

Change-Id: I69ddd0e835e88ef97da8a3a3b949be3f7b619c02

* [RELAY][BYOC] Fix the creation of tuple of tuples in PartitionGraph
    *class description comment amended

Change-Id: I55720bf0467c96e979e1ab56c40d9d209e0f9456

* [NODE][PASS] Introduce config to PassContext. (#5631)

This PR introduces a new config field to the PassContext
to allow it store arbitary config values.

To make sure that the config is validated, we allow each pass
to register the config key they would expect and the corresponding types.

We also introduce a CreateObject from Map<str, Object> to allow config creation
from a json-nest(like in vscode) in python.

We added an example of UnrollLoopConfig.

Followup PR should migrate the passes to use the new config field.

* another cmake fix (#5630)

* Fix typo in test script (#5635)

* Label Pattern Partitions (#5627)

* Label Pattern Partitions with a default label to prevent nested partitions and an optional user supplied-label

* Add node names in topological order to Partitioned attribute

* respond to review comments

* move partition tag into const in attr namespace

* [RELAY][PYTORCH]Resize3d, Upsample3d op support (#5633)

* [TUTORIAL]TFLite QNN Tutorial (#5595)

* [TUTORIAL]TFLite QNN Tutorial

* Review comments

* Extend AttrPattern to support CallNode and FunctionNode attributes (#5637)

* Extend AttrPattern to support CallNode and FunctionNode attributes

* Update tutorial and add breaks

* add func attr test

* [DOCS] Fix the QNN TFLite tutorial build (#5641)

* [TUTORIAL] Fix execution error of TFLite quantized tutorial

* Assign TensorCore to docs build

* [RUNTIME][VULKAN] Seg fault in WorkspacePool's destructor (#5632) (#5636)

* [RUNTIME][VULKAN] Seg fault in WorkspacePool's destructor (#5632)
* fixed this issue by changing WorkspacePool's destruction order

* make line < 100 charactors long

* [PYTORCH]Padding support (#5638)

* Remove unnecessary print (#5642)

* [CI] Allow CI_PYTEST_ADD_OPTIONS to be unbound. (#5644)

This patch allows the test script to execute normally
when CI_PYTEST_ADD_OPTIONS is not available.

* [Runtime] Introduce runtime::Array (#5585)

* Introduce runtime::Array

* Sync with dmlc-core

* Tests added: size, capacity, empty, front, back, push_back, pop_back, insert * 2, erase * 2, resize, reserve, clear

* [CI] Add log check to the sphinx gallery docs (#5643)

* [CI] Add log check to the sphinx gallery docs

This PR add log check to sphinx gallery tutorials to prevent
the case when sphinx failed to capture the error in tutorials.

* Fix the status

* [RELAY][BYOC] Preserve type information in Merge Composite (#5640)

Keep the type information when extracting patterns
so that it can be used as part of 'check' functions.

Change-Id: I16cc70c3d013a794d2ceefb5bec815129c7b8825

* Add a check Callback to the Pattern Paritioner (#5646)

* add a check callback to the paritioner

* fix doc string

* fix unit test spelling

* add a test with types

* [Relay, Topi][OP] Correlation (#5628)

* [Relay,Topi] Correlation

* fix

* move

* typo

* Update test_topi_correlation.py

* HG: Commit message of changeset 6281661. (#5622)

[Relay] Move compiler_begin/end_op to local static objects

* [AutoTVM] Update XGBoost verbosity option (#5649)

* [RUNTIME] Resolve constexpr issue in debug mode. (#5651)

static constexpr is a bit weird before c++17.
They are not inlined by default and does not have symbols after compilation.
It usually isn't a problem when they are inlined(in c++17 they are inlined by default).
But will create compilation error when passed to functions that take (const)references.
This PR fixes the problem so that we can compile on debugmode.

* µtvm debug improvements (#5648)

* Forever loop in UTVMDone to aid debugging

* Use parameter and callback function as a micro debug hook.

 * Previously, users had to uncomment a region of code in
   micro_session.cc and recompile to debug. Now they can pass in a
   key in the micro.Session config:

       config = tvm.micro.device....generate_config()
       config['debug_func'] = _python_launch_gdb
       with micro.Session(config) as sess:
         ....

* clang-format

* Only forever loop on device (on host this blocks unittests)

* [REFACTOR][IR] Migrate IRModule ObjectRef to not-null (#5654)

* Upgrade XGBoost to latest (#5658)

* Increase bss section size. (#5660)

* Likely broken in PR 5590.

* [PatternLang] Convert PatternGrouper to do pre-order, non-recursive analysis (#5653)

* make the PatternGrouper iterate over the input Expr in a non-recursive pre-order fasion

* add a comment

* [Relay,Topi][OP] affine_grid and grid_sample (#5657)

* [Relay,Topi][OP] affine_grid and grid_sample

* lint

* [TIR][BUILD] Remove buffer params from pass config. (#5652)

Buffer configurations can be passed during construction
and does not need to be part of the build config.

This is a refactor step to simplify the BuildConfig for the PassContext migration.

* handle likely in IRMutatorWithAnalyzer (#5665)

* [TOPI] Improve CUDA softmax scheduling (#5600)

- Do not use multiple kernels

- Schedule with warp reductions

- Fixed a bug on the lower warp memory pass

- Fixed warp shuffle intrinsics for the nvptx backend.

Signed-off-by: Wei Pan <weip@nvidia.com>

* [Relay][Op]Support symbolic TopK, Ones, Zeros and Full (#5459)

* Support symbolic TopK, Ones, Zeros and Full

* Fix pylint

* Add docstring for topk shape func

* Fix grad

* Fix lazy_gradient_init

* Fix parser

* Fix print ir text

* Fix lint

* Improve pattern_util

* Fix topk

* Fix build

* Use Optional for attribute

* Fix clang-format

* Minot fix

* Fix pylint

* Fix build warning

* Fix parser

* Move ToScalar

* Fix lint

* Fix lint

* Make topk shape func as data independent when k is constant.

* Fix lint

* Minor fix

* [PYTHON] Add buffer name when creating tensor bindings (#5670)

* [REFACTOR][TIR][API-Change] Migrate BuildConfig to PassContext. (#5668)

* [REFACTOR][TIR] Migrate BuildConfig to PassContext.

This PR migrates the TIR configurations from BuildConfig to the
PassContext used by the unified IR.
Moving forward, PassContext will be the unified way to configure passes in the TVM stack.

Changes

- Refactored TVM_PASS_REGISTER_CONFIG_OPTION to take in the reference type.
- Removed BuildConfig.
- Migrated the passes to use PassContext.

* Update include/tvm/ir/attrs.h

Co-authored-by: Zhi <5145158+zhiics@users.noreply.github.com>

Co-authored-by: Zhi <5145158+zhiics@users.noreply.github.com>

* [Doc] Misc doc fix (#5672)

* [C++ RPC] Fix C++ RPC build problem on Linux (#5671)

* enable amd_apu device on vulkan target (#5659)

* [AutoTVM][TOPI] AutoTVM incorrect measurement (#5511)

* [AutoTVM][TOPI] AutoTVM incorrect measurement

* create new placeholder with converted layout

* update _schedule_winograd

* [POC][PatternLang]Remove constants from partitioned functions (#5663)

* remove constants from partitioned functions

* remove print statements

* [TF] Support TupleWrapper as direct ancestor of control flow ops (#5639)

* add tvm.micro pydoc to sphinx (#5661)

* add tvm.micro pydoc to sphinx

* making build pass and addressing tqchen comments

* add a check for null function attributes (#5674)

* [BYOC] Pattern Language MergeComposite (#5656)

* Pattern Language MergeComposite

* fix DNNL pattern

* Use builtin binary operator syntax for demo

* Improve unit test

* add a testcase for #5674 (#5677)

* Call previous excepthook in tvm_excepthook. (#5675)

* Call previous excepthook in tvm_excepthook.

* Rename prev_excepthook.

* Create a tvm_wrap_excepthook to wrap a given excepthook with tvm custom excepthook work
and call it on system previous excepthook.

* Add docstring.

* Fix the shift column for scale_shift_nchw and scale_shift_nhwc in C topi (#5679)

* [Bugfix] Fix Python debugger segfaults with TVM built with LLVM (#5685)

* Import readline before loading libtvm

* make lint happy

* [DOC] Improve Pattern Language Docs (#5676)

* [DOC] Improve Pattern Language Docs

* address comments

* address comments

* [TFLITE]Quantize & Dequantize op (#5394)

* [TFLITE]Quantize & Dequantize op

* Testcases added

* Review comment fixed

* [TIR][REFACTOR] std::string -> String Migration in TIR nodes (#5596)

* [TIR][REFACTOR] std::string -> String Migration for Var node and SizeVar Node

* update json_compact.py

* [PatternLang] Add ConstantPattern (#5689)

* Add ConstantPattern

* update doc

* [PYTORCH]Minor bug fixes (#5683)

* [PYTORCH]Minor bug fixes

* Review comment fix, testcase added

* Added testcase for bert model

* [Relay] Fix dataflow_pattern.rewrite() hang if Match in IR (#5680)

rewrite() quits only if graph stop changing, but ExprMutator
  always creates new Match node. This patch fixes this.

* [RELAY] Fix segfault in pretty print when ObjectRef is null (#5681)

* [RELAY] Fix segfault in pretty print when ObjectRef is null

Encountered when pretty printing module with function attribute equal to NullValue<ObjectRef>().

Change-Id: I2e7b304859f03038730ba9c3b9db41ebd3e1fbb5

* Add test case

Change-Id: I579b20da3f5d49054823392be80aaf78a055f596

* [REFACTOR][RELAY] move fallback_device to config (#5690)

* @zhiics -> PPMC (#5692)

* [COMMUNITY] @masahi -> PPMC (#5691)

* Support more dtypes for TVMDSOOp (#5694)

* [ONNX]LpPool Support added (#5696)

* In memory_plan, check if value is not None, instead of just checking value as boolean. (#5700)

* [PatternLang]Conditionally Embedding Constants in Partitioned Functions (#5693)

* Embed constants in the partition function if the pattern explicity requests constants

fix rst

fix pylint

* improve comments based on Cody's feedback

* [ONNX] Skip ADD inside Gemm op when vector is zero (#5697)

* [BYOC] Support Tuple Output in C/DNNL Codegen (#5701)

* Support tuple output runtime

* fix unit test

* [REFACTOR][RELAY] Replace build_config with PassContext (#5698)

* [PYTORCH]floor_divide support for squeezenet (#5702)

https://github.com/apache/incubator-tvm/issues/5133#issuecomment-636330705

* [AutoTVM][TOPI] Fix bifrost spatial packing conv2d auto tune (#5684)

* [AutoTVM][TOPI] Fix bifrost spatial packing conv2d auto tune

* [AutoTVM][TOPI] Putting placeholder replacement in compute

* Fix winograd kernel replacement

* Fix sanity check: Line too long

* [Arith] ExtendedEuclidean merge impl to int_operator (#5625)

* fix typo: anchor windoes should be anchor windows (#5706)

* [REFACTOR][PY] relay.op.Op -> tvm.ir.Op (#5705)

* [REFACTOR][PY] relay.op.Op -> tvm.ir.Op

* Improve the error check

* [PatternLang] Simplify Pattern API Implementations (#5703)

* Add syntatic sugar; include pattern to API docs

* fix doc warnings

* [PYTORCH]ReplicationPad support added (#5708)

* Remove deprecated opengl files (#5711)

* Remove opengl runtime and cmake (#5712)

* [BUGFIX][CRT] Fix Compilation Error in CRT (#5713)

* Rename tvm_dso_op to libtvm_dso_op (#5714)

* [Object] Unify StrMapNode and MapNode (#5687)

* Pass cpptest and py unittest

* fix graph runtime

* right fix

* fix a bug that runtime::String's operator < is actually compare by address

* Update container.py

* Renaming

* Address comments

* lint

* Replace ObjectHash in object.py

* [MXNET]Softmin, trunc op support added (#5715)

* Avoid downloading when TOPHUB_LOCATION is NONE (#5720)

* [Object][FFI] Introduce runtime::String::CanConvertFrom (#5718)

* [Object][FFI] Introduce runtime::String::CanConvertFrom

* Update container.h

* [Object] Restore the StrMap behavior in JSON/SHash/SEqual (#5719)

* Fix generating types like float44 and float88 (#5722)

* [ONNX]ReduceL1, ReduceL2, ReduceSumSquare, ReduceLogSum ops added (#5721)

* [TENSORFLOW]StatefulPartitionedCall/PartitionedCall Ops support added  (#5617)

* Implemented functionInvocation Unit Test for StatefulPartitionedCall operator(working) and initial changes for placeholder(not working as of now)

* Placeholder exercises with tvm

* placeholder interim

* SPOP Test cases structure

* New test cases for spop

* miscellaneous test cases for spop

* Placeholder samples..working with shapes explicitly passed

* Variables test case. Works with the same fix of shape_dict

* SPOP Positive test cases first iteration

* support output tensors as function args, multiple functions

* Corrected Indentation

* filewritter is only for debug purpose

* support variables in function args

* First working iteration of positive spop test cases

* Removed commented code, simplified code

* Code Reorganization- First working iteration of positive spop test cases

* corrected variable name after refactor

* Code Reorganization- First working iteration of positive spop test cases

* move code inside mapped operator function

* Removed extra line

* support variables in function args

* Removed commented code, simplified code

* move code inside mapped operator function

* Code Reorganization- First working iteration of positive spop test cases

# Conflicts:
#	tests/python/frontend/tensorflow/test_forward.py

* Code Reorganization- First working iteration of positive spop test cases

* Function invocation more test cases

* Simplified & Merged different Function Invocation Test cases

* support invocation of nested callables

no need to explicitly handle paratitioned and
statefulPartitioned condition in convert_operator function

* Simplified and Uniform testcases

* support invocation of nested callables

no need to explicitly handle paratitioned and
statefulPartitioned condition in convert_operator function

* Simplified and Uniform testcases

* removed duplicate and renamed testcase

* Negative scenario added for testing operator statefulness. Only Exception to stateful operators are Partitioned & StatefulPartitionedOp which have capability to execute even stateless operators within them

* Miscellaneous reorganization changes for spop scenarios

* Miscellaneous reorganization changes for spop scenarios

* Corrected import of tensorflow modules safely using try except and other code reorganization

* Negative scenario for resource variables handled

* Documentation update for code

* SPOP change in function handling

* handle nested subgraph

* refactor

* get op def compatible with tf 1x & 2x

* Fixed liniting issues

* added doctsring and few nits

* Merged changes for positive test cases and negative test cases

* Moved StatefulPartitionedCall test case to the end of the TC list

* Fixed some typos and semantics

* dmlc-core

* dmlc-core

* fixes

* Addressing Review comments in the PR for SPOP support

* Fixed pylint errors

* Corrected tensorflow import syntax

* Placed the op_def_registry module import outside of for loop

* Removed new stateful operators list and combined these operators with missing operators to display as single list. Also removed throwing seperate exception for stateful ops

Co-authored-by: Prashant Sail <psail4444@gmail.com>
Co-authored-by: maheshambule <mahesh_ambule@persistent.com>

* [AutoTVM, Relay] Clear compile engine after task extraction (#5724)

* Fix runtime::String backward compatibility in JSON (#5725)

* codegen llvm: move nvptx-specific intrinsic handling into codegen_nvptx (#5726)

See discussion in #5600.

I'm also throwing in a pointer lifetime fix for the context held by
NVPTX because otherwise topi/tests/python/test_topi_softmax.py
would sefault for me. With the test, I can also run resnet-18 on
the nvptx target in gpu_imagenet_bench.py.

* [TOPI,RELAY][TFLITE] Sparse to dense operator (#5447)

* [Relay][Frontend][TFLite] Add parser support for shape and range

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* [TOPI,RELAY][TFLITE] Sparse to dense operator

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* use param name in documentation

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* sphinx doc errors fixed

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* incorporated review comments

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* Missing a blank line...

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* use get_tensor_expr

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* Accidently removed this function in the rebase...

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* support default value for default_value

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* clang format fixes

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* topi pylint fixes

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* [Frontend][TFLite] Add parser support for shape and range (#5329)

* [Relay][Frontend][TFLite] Add parser support for shape and range

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* Incorporated review comments and used new functions

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* Few cosmetic changes

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* Removed an extra line added by rebase...

Signed-off-by: Dhruva Ray <dhruvaray@gmail.com>

* [REFACTOR] Separate ArgTypeCode from DLDataTypeCode (#5730)

We use a single enum(TypeCode) to represent ArgTypeCode and DLDataTypeCode.
However, as we start to expand more data types, it is clear that argument
type code(in the FFI convention) and data type code needs to evolve separately.
So that we can add first class for data types without having changing the FFI ABI.

This PR makes the distinction clear and refactored the code to separate the two.

- [PY] Separate ArgTypeCode from DataTypeCode
- [WEB] Separate ArgTypeCode from DataTypeCode
- [JAVA] Separate ArgTypeCode from DataTypeCode

* [ONNX]MaxRoiPool, Mod & Xor op support added (#5729)

* ROCm: Add warp shuffles and enable reductions (#5727)

Thank you @masahi and @wpan11nv for the feedback

* Change 'delete's in Relay VM Instruction dtor to 'delete[]'s (#5735)

* Fix reshape usage in ARM Winograd (#5732)

* [TEST] Fix flaky topi/tests/python/test_topi_pooling.py:test_adaptive_pool (#5736)

* Fix the values for test_fmod since it fails way too often otherwise (#5723)

* fix small bug about dense_grad (#5695)

* [REFACTOR][ARITH] Remove legacy compute_expr.h (#5738)

Replaces most of the ComptuteReduce using foldl.

* Add some docs on downstream consistency (#5742)

https://github.com/apache/incubator-tvm/pull/5730#issuecomment-639567636

* sequential cpp test (#5745)

* [REFACTOR][TE][TIR] Call::Halide => ProducerLoad, DSL/TIR decouple. (#5743)

In the HalideIR's design, DSL components and IR are mixed together.
For example, Call::Halide can containa reference to a function which is
constructed in the tensor expression language.

While this coupled design simplifies certain aspect of the DSL construction,
it prevents the TIR to evolve as a clean standalone IR:

- The additional tensor expression provided in the function is opaque to the IR
  and may become obsolete as we transform them.
- The duplication of the information in the DSL tensor and IR makes it hard to
  design a stand-alone text format (when there are elements shared in the tensor
  expression and normal statements).

This PR aims to clearly de-couple the TIR from high-level DSL structures(tensor expression),
while still provide clear extensions to build DSLs on top of the TIR.

We introduce a DataProducer as a base class for high level tensor expressions objects
that produce data. We then introduce ProducerLoad to replace the Call::Halide usage,
so that the Call node can always be self contained and used for low-level calls.

The high-level tensor expression DSL can still generate a PrimExpr that contains a ProducerLoad.
These PrimExprs contains fragments of information that can be combined together to
generate a low-level TIR PrimFunc.

We also state clearly that DataProducer **should not** appear in any TIR PrimFunc.
Instead, the high-level DSL layer should lowered DataProducers to Buffers and TIR statements
that produces these buffers. We can further provide verifications to validate such invariance.

Changes:
- Introduce DataProducer to serve as a base class for Tensor in tensor expressions.
- Migrate use of Call::Halide to ProducerLoad
- Migrate the other usages of Calls.

We will also create follow-up PRs to migrate the remaining two DSL related IR nodes(Realize/Provide)
to use the DataProducer.

* Don't add cast for TF batch norm when type isn't changing (#5731)

* [ARITH][BACKPORT-0.6] fix a min/max simplify bug (#5749)

* fix a min/max simplify bug

* fix cpplint

* turn into oposite when c1val<0 and add more case

* fix c1=0

Co-authored-by: xqdan <danxiaoqiang@huawei.com>

* [TOPI][Relay][OP] support dynamic NMS(Non Maximum Suppression), symbolic begin, end, and strides for strided_slice (#4312)

* [TOPI][Relay][OP] Dynamic NMS and strided_slice

* Incorporate comments

* fix nnvm compatibility issues

* fix InferCorrectLayout

* Minor fix

* fix for fuse

* Workaround to pass batch_size into hybrid function to handle dynamic shape

* Seperate rearrange

* fix lint

* fix ci, comments

* change attr to Optional<T>

* clang format

* remove empty lines

* partial ignore for end of strided_slice

* pylint

* add out_indices for gpu get_valid_counts

* change to slice_mode

* clang-format, fix comments

* fix comment

* change slice_mode to string

* fix CI

* update docstring

Co-authored-by: Yao Wang <kevinthesunwy@gmail.com>

* Update dmlc_tvm_commit_id.txt

* Update TRT Integration to reflect upstream changes

* Sync submodules

* Fix jenkinsfile

* git-clang-format against origin/dev instead of origin/master

* Fix formatting.

* Remove is_empty in export_lib (used for old trt)

* Disable test_forward_qnn_mobilenet_v2_net

* Add Scatter to Topi/Relay/ONNX via hybrid script (#5619)

* I can construct scatter but not embed it in a Relay Graph

* working 1-4 dimesion scatter

* add scatter to ONNX

fix lint

* isolate tests to cpu backend

* Fix i386 test

* fix gpu tolerance

* use elemwise_shape_func for scatter

* fix incorrect rebase

* [Minor][Test] Clean WASM environment before build (#5759)

* [Bugfix] Fix reshape (#5739)

* Fix reshape

* fix doc warning

* fix ci

* address comments

* [REFACTOR][TIR] Provide->ProducerStore, Realize->ProducerRealize. (#5750)

This PR finishes up the final step for DSL/TIR de-coupling to refactor
Provide/Realize to use the DataProducer.

As in the case of ProducerLoad, ProducerStore/Realize are not supposed
to appear in a vaid TIR function ans are only used by high-level DSLs
as intermediate structures.

* [Rust] Second stage of Rust Refactor (#5527)

* Add tvm-rt crate

* Backport changes from frontend branch

* Format

* Add ASF headers

* Address self-code review

* Replace with helper

* Fix lint

* Fix

* Clean up repro debugging

* WIP

* Remove global resgistry to fix one memory issue

* Fix

* Format

* Format

* Update rust/tvm-rt/README.md

Co-authored-by: Jason Knight <binarybana@gmail.com>

* Format

* Duplicate TVM macros

* Split macros

* Restore old macro for old crates

* Repair macros

* Fix format

* Format

Co-authored-by: Jason Knight <binarybana@gmail.com>

* [topi] block sparse dense on cuda (#5746)

* [Relay] Fix for recursive let (#5757)

* Make let processing iterative

* Try again

* Fix pretty printer overflow

* cleanup

* fix lint

* Fix text printer

Co-authored-by: Jared Roesch <roeschinc@gmail.com>
Co-authored-by: Jared Roesch <jroesch@octoml.ai>

* [TOPI][RELAY][PYTORCH]Conv3d_transpose op support added (#5737)

* [TOPI][RELAY][PYTORCH]Conv3d_transpose op support added

* Test cases in topi/relay

* conv3d_transpose_ncdhw_python added

* Review comments fixed

* Fix gelu in PyTorch frontend, tighten numerical checks (#5763)

Previously, the PyTorch frontend approximated gelu with fastgelu.
To provide a more faithful conversion, we implement gelu instead.

We also tighten the numerical comparisons between PyTorch and
TVM-from-PyTorch to 1e-5. The object detection models need an
increased tolerance of 1e-4 to pass.

I had to throw in a few fixes for missing conversions
(probably due to working with very new PyTorch).

I must admit the GoogLeNet/NasNet test didn't run on my machine,
probably due to problems at my end.

* Add ShapePattern and DataTypePattern (#5760)

* Make batch matrix multiplication on GPU tunable (#5752)

This is primarily aimed at the AMD GPU backend and done as part
of a project for AMD, but should work for all users of the GPU
schedule.

* [TIR][REFACTOR][API-Change] Migrate the tvm/tir/expr.h to construct style. (#5773)

This PR migrate tvm/tir/expr.h to the new constructor style that is
consistent with the rest of the codebase and changes the affected files accordingly.

* [TIR][REFACTOR][API-Change] Migrate tir/stmt.h to use constructor. (#5778)

This PR migrate tvm/tir/stmt.h to the new constructor style that is
consistent with the rest of the codebase and changes the affected files accordingly.

* [Frontend][TensorFlow] Improve Control Flow and TensorArray (#5699)

* Improve TF parser control flow and tensor array

* Fix tf tensor array scatter

* Add ssd test

* Add back static ta test

* Minor fix for frontend and test_forward

* SplitRel for dynamic shape

* Fix test ssd

* Fix loop var naming issue

* Minor improve

* Fix format

* Fix clang format

* Fix tensor array in pytorch frontend

* Fix stack size issue for ssd test

* Address comments

* Fix slice size

* Fix build

* Rebase

* [DOC][FIX] Fix some typos in git-clang-format.sh (#5786)

* fix #5686: remove a overstrict assert in MakeAllreduce (#5686) (#5785)

* [RUNTIME] Add compile_shared option to linux compile utility fn (#5751)

* feat: Add compile_shared option to linux compile fn

* feat: Add compile_shared option for linux compile util fn

* fix: Fix minrpc testcase use executable compilation

* fix: Fix binutil case where call create_shared to create executable

Co-authored-by: baoxinqi <baoxinqi@4paradigm.com>

* [REFACTOR][API-Change] Migrate all Object construction to constructor. (#5784)

This PR migrates all the remaining object constructions to the new constructor style
that is consistent with the rest of the codebase and changes the affected files accordingly.

Other changes:

- ThreadScope::make -> ThreadScope::Create
- StorageScope::make -> StorageScope::Create

* [Topi] pass-by-value -> pass-by-const-reference (#5783)

* [topi][relay] Add operation gather to relay. (#5716)

* [CODEGEN][CONTRIB] CoreML codegen (#5634)

* [CODEGEN][CONTRIB] CoreML codegen

* import coremltools only when it is necessary

* fix pylint errors

* don't import contrib.coreml when using runtime lib

* skip coreml codegen test in CI

* don't register relay.ext.coremlcompiler in __init__.py

* move tvm/contrib/coreml.py to tvm/contrib/target/coreml.py

* use existing transformers for graph partitioning

* skip test only when coremltools is not available

* add check for annotation

* move _register_coreml_op to python/tvm/relay/op/contrib/coreml.py

* skip compile when xcode is unavailable

* relay.op.Op -> tvm.ir.Op

* set USE_COREML on

* refine test

* fix calibration pass to support multiple functions (#5768)

Co-authored-by: Ubuntu <ubuntu@ip-172-31-43-142.us-east-2.compute.internal>

* [cmake] update vulkan rules (#5777)

* Add ignore storage_order attribute to onnx pooling parser. (#5781)

* [BYOC][FIX] Infer types in MergeComposite (#5766)

If InferType isn't run between partitioning passes,
function calls are inserted which don't have a type.
This can result in failures for patterns which want
to check types.

This works around it simply by running InferType after
every partitioning.

Change-Id: Ie0887f0564a41eb0913bfe42a362e8effe9681b9

* [FRONTEND]Darknet support batch size for yolo (#5688)

Fix the issue reported in 
https://discuss.tvm.ai/t/yolov3-tiny-batch-input-test-failed/6796

* Update dmlc_tvm_commid_id.txt

* Skip tflite test_forward_mediapipe_hand_landmark

* Increase stack limit for failing tflite tests. Skip TF tests which require TF 1.x

* [PYTORCH]aten::norm support added (#5776)

* [TENSORFLOW]Conv3d Transpose OP added (#5775)

* [TENSORFLOW]Conv3d Transpose OP added

* Testcase updated, tf cpu supports only ndhwc

* [TF] Support symbolic inputs of Fill (#5762)

* [TF] Support symbolic inputs of Fill

* Rebase and simplify. Value has been converted to constant if it is
tf.Constant

* [COMMUNITY] @wpan11nv -> Reviewer (#5790)

* Edit onnx parser to infer values in post order (#5755)

* edit onnx parser to infer values in post order to speed up onnx imports with many calls to infer_value

* fix pylint

* [TIR][REFACTOR] Cleanup unused classes (#5789)

* Fix tf parser (#5794)

* support aten::type_as in the pytorch frontend (#5787)

* support aten::type_as in the pytorch frontend

* use _convert_data_type to convert torch type to tvm type and add more types in the type_as test

* [TIR][REFACTIR] Update TIR nodes std::string->String. (#5793)

This PR updates the remaining TIR node's member to use
String instead of std::string.

* [TEST] Temporary disable fp16 type_as test for PyTorch Frontend (#5799)

* [ONNX] Skip multiply with 1.0f constant for GEMM import (#5800)

* [ONNX] Skip ADD inside Gemm op when vector is zero

* [ONNX] Skip multiply with 1.0f constant for GEMM import

* [TIR][REFACTOR] Add tir prefix to type keys (#5802)

* [QUANTIZE] Add config switch for nn.dense layer type. (#5801)

* [topi] fix sparse dense schedule on cuda (#5803)

* Allow RPCWrappedFunc to rewrite runtime::String as std::string (#5796)

* [topi] fix strategy for sparse dense cuda (#5782)

* [CI] Move cpu-only frontend tests to a CPU stage (#5807)

* [MXNET]conv3d and conv3d_transpose addedx (#5814)

* Pin hand landmark network to version 0.7.4. (#5813)

* Versions above 0.7.4 are broken due to changes in the
   quantization operations in the model, which are current
   not supported by TVM.

Fixes #5774.

* [CI] Limit number of threads in all jobs (#5815)

* Update dmlc_tvm_commit_id.txt

* Disable tensorflow.test_forward_sdd because stack limit of 100mb is exceeded by WellFormedChecker

Co-authored-by: Samuel <siju.samuel@huawei.com>
Co-authored-by: ANSHUMAN TRIPATHY <anshuman.t@huawei.com>
Co-authored-by: wsl-inspur <61525780+wsl-inspur@users.noreply.github.com>
Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com>
Co-authored-by: Matthew Brookhart <mbrookhart@octoml.ai>
Co-authored-by: Mahesh Ambule <15611578+maheshambule@users.noreply.github.com>
Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>
Co-authored-by: Animesh Jain <anijain@umich.edu>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-34-212.us-west-2.compute.internal>
Co-authored-by: Thierry Moreau <tmoreau@octoml.ai>
Co-authored-by: tobe <tobeg3oogle@gmail.com>
Co-authored-by: Jared Roesch <jroesch@octoml.ai>
Co-authored-by: Nick Hynes <nhynes@berkeley.edu>
Co-authored-by: Tang, Shizhi <rd0x01@gmail.com>
Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Wei Pan <60017475+wpan11nv@users.noreply.github.com>
Co-authored-by: Tom Gall <tom_gall@mac.com>
Co-authored-by: MORITA Kazutaka <morita.kazutaka@gmail.com>
Co-authored-by: masahi <masahi129@gmail.com>
Co-authored-by: Haichen Shen <shenhaichen@gmail.com>
Co-authored-by: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>
Co-authored-by: Menooker <Menooker@users.noreply.github.com>
Co-authored-by: Josh Fromm <jwfromm@uw.edu>
Co-authored-by: lixiaoquan <radioheads@163.com>
Co-authored-by: Li Xiaoquan <xiaoquan.li@denglin.ai>
Co-authored-by: Candy <1915998056@qq.com>
Co-authored-by: LiangLiu <liangliu@buaa.edu.cn>
Co-authored-by: lhutton1 <35535092+lhutton1@users.noreply.github.com>
Co-authored-by: Giuseppe Rossini <giuseros85@gmail.com>
Co-authored-by: Andrew Reusch <areusch@octoml.ai>
Co-authored-by: Liangfu Chen <liangfu.chen@icloud.com>
Co-authored-by: Michal Piszczek <imichaljp@gmail.com>
Co-authored-by: Zhi Chen <chzhi@amazon.com>
Co-authored-by: Zhi <5145158+zhiics@users.noreply.github.com>
Co-authored-by: Dhruva Ray <dhruvaray@gmail.com>
Co-authored-by: Liyong Zeng <littlefish0123@users.noreply.github.com>
Co-authored-by: Zeng Liyong <liyong.zeng@streamcomputing.com>
Co-authored-by: Yao Wang <kevinthesunwy@gmail.com>
Co-authored-by: windclarion <windclarion@gmail.com>
Co-authored-by: manupa-arm <61496855+manupa-arm@users.noreply.github.com>
Co-authored-by: Wuwei Lin <wuwei@apache.org>
Co-authored-by: Yi Wang <samwyi@yahoo.com>
Co-authored-by: Cody Yu <comaniac0422@gmail.com>
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
Co-authored-by: mbaret <55580676+mbaret@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Zhao Wu <zhaowu@apache.org>
Co-authored-by: Mei Ye <meiandmimi@yahoo.com>
Co-authored-by: Neo Chien <cchung100m@cs.ccu.edu.tw>
Co-authored-by: notoraptor <notoraptor@users.noreply.github.com>
Co-authored-by: Balint Cristian <cristian.balint@gmail.com>
Co-authored-by: Rand Xie <randxiexyy29@gmail.com>
Co-authored-by: abergeron <abergeron@gmail.com>
Co-authored-by: Deepak <59532278+deepakbabel23@users.noreply.github.com>
Co-authored-by: Prashant Sail <psail4444@gmail.com>
Co-authored-by: maheshambule <mahesh_ambule@persistent.com>
Co-authored-by: Thomas Viehmann <tv.code@beamnet.de>
Co-authored-by: akosik-anyvision <58490408+akosik-anyvision@users.noreply.github.com>
Co-authored-by: handar423 <47707767+handar423@users.noreply.github.com>
Co-authored-by: xqdan <danxiaoqiang@126.com>
Co-authored-by: xqdan <danxiaoqiang@huawei.com>
Co-authored-by: Yong Wu <ywu118@alumni.jh.edu>
Co-authored-by: Jason Knight <binarybana@gmail.com>
Co-authored-by: Zijing Gu <jingjing_gu@live.com>
Co-authored-by: Jared Roesch <roeschinc@gmail.com>
Co-authored-by: majiang31312 <majiang31312@qq.com>
Co-authored-by: wrongtest <wrongtest0@gmail.com>
Co-authored-by: baoxinqi <baoxinqi@4paradigm.com>
Co-authored-by: Yi-Hsiang (Sean) Lai <seanlatias@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-43-142.us-east-2.compute.internal>
Co-authored-by: Bing Xu <antinucleon@gmail.com>
Co-authored-by: Leandro Nunes <leandro.nunes@arm.com>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jun 18, 2020
- Do not use multiple kernels

- Schedule with warp reductions

- Fixed a bug on the lower warp memory pass

- Fixed warp shuffle intrinsics for the nvptx backend.

Signed-off-by: Wei Pan <weip@nvidia.com>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jun 18, 2020
…tx (apache#5726)

See discussion in apache#5600.

I'm also throwing in a pointer lifetime fix for the context held by
NVPTX because otherwise topi/tests/python/test_topi_softmax.py
would sefault for me. With the test, I can also run resnet-18 on
the nvptx target in gpu_imagenet_bench.py.
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jun 18, 2020
- Do not use multiple kernels

- Schedule with warp reductions

- Fixed a bug on the lower warp memory pass

- Fixed warp shuffle intrinsics for the nvptx backend.

Signed-off-by: Wei Pan <weip@nvidia.com>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Jun 18, 2020
…tx (apache#5726)

See discussion in apache#5600.

I'm also throwing in a pointer lifetime fix for the context held by
NVPTX because otherwise topi/tests/python/test_topi_softmax.py
would sefault for me. With the test, I can also run resnet-18 on
the nvptx target in gpu_imagenet_bench.py.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants