Torchdynamo tuning script #9

…#12481) * trace.cc * add tests * remove assert * add proper test * lint * lint

…stores are not generated at LLVM level. This is a workaround for an instruction selection issue in current version of llvm for hexagon (apache#12471)

)

* [TVMScript] IRBuilder, IRBuilderFrame base class This PR introduces basic data structures of the generic IRBuilder across the codebase. IRBuilder is a general-purpose IRBuilder that can be used in TIR, Relax and any other vendor-specific dialects; IRBuilderFrame is where contexual information as stored in the IRBuilder. * fix linter * Update include/tvm/script/ir_builder/base.h Co-authored-by: Junru Shao <junrushao1994@gmail.com>

* Auto-vectorization (fp16) for v68 * use tvm.testing.main in fp16 test of tanh_slice op

* add bfloat16 promotion for CallNode * add softmax to bfloat16 build test

Previously `CMSISNNFlags` was derived using logic specific to the external code generator, this converts the external code generator options into a `Target`.

…che#12474) * [Target] Only append default keys if target doesn't have any yet This allows target parsers to provide their own target keys. Without this change, the default keys would always be appended, which may or may not be desirable. * Add "cpu" to ARM CPU keys * Add "cpu" to the keys in the mprofile target parser * Restore the mprofile cpptest, since the "cpu" key is back * So the -device attribute is actually needed...

To figure out a user's association with the repo this code before searched the associations in the repo filtered by the relevant username. GitHub doesn't return the exact match only though, so we have to instead collect many results and search through all of them. Co-authored-by: driazati <driazati@users.noreply.github.com>

…2484)

* add config space * lint * lint

* fix scatterND large shape problem * fix thread pool alloca * add scatternd unit test * update with comment * Empty Co-authored-by: wrongtest <wrongtest0@gmail.com>

Fix some typos in src/. Co-authored-by: driazati <driazati@users.noreply.github.com>

…apache#12497) * [Relay][Layout] FInferCorrectLayout for L2 norm layout change. * [Relay][Layout] Test for L2 norm layout transform. * [Relay][Layout] Re-edit test to add multi-dimensional axis list. * Fix cpplint errors * Use clang-format-10 rules. * replace uint with size_t.

…apache#12516)

Following apache#12197, this PR introduces `Schedule.show()` which convenience the user experience in the following two aspects: - Python syntax highlighting - Outputs a schedule function instead of standalone instructions so that it's easier to follow. To demonstrate this change: - Before `Schedule.show()` is introduced: <img width="555" alt="image" src="https://user-images.githubusercontent.com/22515877/185713487-03722566-1df7-45c7-a034-c1460d399681.png"> - After this change: <img width="583" alt="image" src="https://user-images.githubusercontent.com/22515877/185713564-c54f3a9d-cd52-4709-a8b8-d8a61361e611.png">

This PR migrates the existing MemoryDatabase, which is implemented in python at the moment, to C++. The original intent of having an in-memory database that does not persist on disk is merely for testing, but as times go on, we found it useful in production workflow, and thus decided to migrate it C++ for potentially better performance.

This PR: - Adds an entry point for the TVMScript Unified Printer - Adds a helper object class `RootNodeContainer` to provide an injection point for the actual printer implementation to add specialized logic on the root node to print. Tracking issue: apache#11912

) This PR adds boolean operators to OperationDoc. This is needed by the TIR expression printing because it has `tir::And` and `tir::Or`. Tracking issue: apache#11912

…e#12347) Removes support for driver stack versions older than 22.05 (semantic 3.0.1). Additionally, changes the integration to make version checks using semantic versioning rather than the previous year.month versioning method.

…#12489) * [TIR] Support AllocConstantNode in CreatePrimFunc * Handle AllocConstantNode in LeafBlockRemovalPlan * Properly handle AllocConstNode in BufferAllocationLocator * handle AllocateConst in EstimateFlops * remove NDArray printing * doc update * add test * cpplint * Removed dependency on link-params attribute from target * Restored NDArray printing to unbreak test

…ache#12532) In TVM ONNX frontend, constants are folded by default, which makes `test_load_model__onnx` to fail because it is looking for "params" that were already converted into constants. This patch fixes the test to disable constant folding so that we can assert that "params" in the model are present as expected.

…orskip (apache#12528) `test_meta_schedule_integration_extract_from_bert_base` depends on the `transformers` package, which is not currently installed in our Docker images. When running this test currently, it fails with an ImportError. This patch makes this dependency explicit and will make the test to be skipped when the dependency is not installed. `test_meta_schedule_integration_extract_from_bert_base` is part of the integration tests, which is currently only running on AArch64 and CPU image (both not at the moment with torch installed in the live CI system), so this is another issue to be understood/fixed.

* expose project_options in autotuning * address comment * address comment Co-authored-by: Mohamad <mkatanbaf@users.noreply.github.com>

…e_read (apache#12505) * Add optional consumer blocks to cache_read. * remove comments * Fully functional * Add test for consumer targetting. * Formatting. * Add missing parameter comment. * Fix comments * Simplify type of consumer_blocks in python. * Change how consumer_blocks is printed in python.

This adds a testing utility so we can mark parameter combinations as xfail without having to manually match each parameter from the name into the code. The param strings here come directly from CI logs as in https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-12389/5/pipeline Co-authored-by: driazati <driazati@users.noreply.github.com>

) This PR intends to move the alexnet and googlenet caffe models from the old link to s3, therefore getting rid of the flakiness in `caffe/test_forward.py` introduced by external url timeouts. Fixes apache#12465

* [LLVM] Add "cl-opt" attribute to target_kind "llvm" Add LLVMTargetInfo class that can be used to query the LLVM configuration without forcing an LLVMTarget to be created. There is no programmatic way to obtain the actual type of an LLVM option. The type is necessary to obtain the value of the option, hence it must be provided as a part of the option string. See src/target/llvm/target_kind.cc for more information about the syntax. * Fix lowercasing of bool value string * Use std::optional instead of std::pair<..., bool> * Treat malformed options as fatal errors * Fix linter * More unit tests for option parsing, have one case per test * Remove "option ignored" from fatal error messages

There was a flaw in uma_lower (see issue apache#12410) that lead in some case to a different argument ordering of the cached_func and the Relay function. This results in an incorrect lowering of the primfunc and eventually a wrong result of a run-time error, in some cases. This commit adds code to correct the described misbehavior and a unit test case to check this end-to-end functionality with a TFLITE model.

* [TIR] Add pass to check for out of bounds memory access This is a conservative static analysis that checks to see if any out of bounds array access occurs. It is not enabled by default. * formatting * manually construct local irmodule * update comment * fix bug in int_set

Co-authored-by: Mohamad <mkatanbaf@users.noreply.github.com>

…function (apache#12539) * Add workaround for apache#12538 * Rework evaluate_model_accuracy into predict_labels_aot

* Replace microTVM static fixtures with parameterization * [microTVM] Only perform parameterization when fixture is present * Reformat with black * Fix Cortex-M tests * Add docstring to pytest_generate_tests * Remove trailing space from docstring

This PR documents the steps to introducing a new CI docker image, which we've been doing a lot lately.

In Keras 2.7, one "reshape" operator was removed from the Mobilenet model, making our test which verifies the number of operators to be incorrect. This patch adjusts the operator count so that it is in line with the changes in Keras. For reference, the change in keras repo was done in hash b6abfaed132 "Remove unnecessary reshape layer in MobileNet architecture".

Co-authored-by: Leandro Nunes <leanun01@e123855.arm.com>

This pr fixes the tests that will be broken when we will update TFLite to the 2.9 version. We will update TensorFlow and TFLite versions to 2.9 so that we can benefit from improvements in packaging to support multiple platforms and Operating Systems.

Pass that fuses nn.pad and qnn.conv2d for CMSIS-NN target.

Some integration tests are failing when running in CI machines that have torch installed (validated only in AARch64 for now), with an error message related to libgomp, similar to the one above: OSError: /.../dist-packages/torch/lib/libgomp-d22c30c5.so.1: cannot allocate memory in static TLS block As part of enabling the integration tests in AArch64, I'm marking this tests as skipped, so that tests can start executing and don't regress while we take time to investigate these specific failures.

Fixes a small issue when converting the output information to the support library API. The `requantize_info` output datatype needed updating with the output datatype from the relay function to ensure the graph is compiled correctly by the support library. Included a test to prevent regression in the future.

* Add Rsqrt to SimplifyExpr * fix unit tests

) Currently, AutoTVM's ApplyHistoryBest class does not support loading tuning logs from memory. This is a pet peeve of mine, as it requires you to work with a tempfile whenever writing autotuning tests. This is also just strange, as the rest of AutoTVM has support for text buffers (e.g. tvm.autotvm.callback.log_to_file supports passing in a text buffer, letting us write to but not read from them). Additionally, ApplyHistoryBest handles input arguments very unintuitively. Before this PR, it allowed users to pass string filepaths, a list of string filepaths, or an Iterable (such as a list) of input and result tuples. However, it did not support taking in StringIO objects as mentioned above, nor pathlib.Path objects, nor combinations of a filepath and an Iterable of tuples. In a perfect world, we would change ApplyHistoryBest to take as input a path-like object, file-like object, or an Iterable of input and result tuples (similar to what ApplyGraphBest takes as an argument). However, this would break the existing functionality to take as input a list of filepaths. To be backwards compatible, while fixing this issue, this pull request defines a new type inside dispatcher.py: Records = Union[ Union[str, bytes, Path], # Path-like objects TextIOBase, # File-like objects Iterable[Tuple[MeasureInput, MeasureResult]], ] It then rewrites ApplyHistoryBest.load so it takes the following arguments: def load(self, records: Union[Records, Iterable[Records]]): This PR also adds unit tests for this new functionality, and fixes a relevant bug in tests/micro/common/test_autotune.py in which a StringIO object was passed to apply_history_best, causing it to appear to pass but not actually read any data.

See apache#12511 for context. Since more parameterizations are popping up as failed, this disables whole tests rather than specific combinations of parameters.

While the dependencies for microNPU and CMSIS-NN moved into ci_cortexm, Vela is still installed in ci_cpu. As a result, we have some of the microNPU tests outside of test_ethosu folder failing since they use precence of Vela to decide whether to skip the test. This change will * remove Vela from ci_cpu * remove unnecessary PATH update

This pr adds support for the special indices values of the reshape operator for the Arm(R) Ethos(TM)-N NPU.

* heap-size is added to project options * change stm32l4r5zi recommended heap size * change stm32l4r5zi recommended heap size * addressing comments * addressing comments * addressing comments Co-authored-by: Mohamad <mkatanbaf@users.noreply.github.com>

… NFC (apache#12562)

…che#12561) This silences warning ``` warning: 'foo' hides overloaded virtual functions [-Woverloaded-virtual] ``` typically caused by overriding only some overloads of `VisitExpr_` from a set defined in the base class.

* remove depricated parameters in target * lint * fix cpp tests fix * remove more configs in test files * address comments * fix error * fix hexagon * fix micro tutorial * fix integration tests * fix hexagon * lint * fix unittest * fix readme * fix assert executor in target * address comments * fix tutorials * fix hexagon target * fix tutorial * fix for tutorials * hexagon

* Fix numerical instability for log sigmoid Fix numerical instability for log sigmoid in pytorch frontend * update * add test for overflow check * merging two tests

…lue (apache#12558) `compute_cycles` can be the size of an int64 value, however it seems that when that value is attached to the IR as a pragma from Python, it is interpreted as an `int`, rather than `int64_t`. This commit adds an explicit cast to ensure the value is interpreted correctly. The reason these values started appearing very large and randomly is still yet to be solved, although the hope is that this fix will unblock CI. Change-Id: Idcdd7d37af1acd665590c87624446a025b50eb3d

Introducing -n auto for CMSIS-NN tests to run them in parallel with pytest-xdist. This is needed because of additional parameterization done over cpu variants. Change-Id: I02e1b37ead0b0a562b5b1b2dacfeb3fdd7cc1ce3

This commit adds support for the `resize` operator for Arm(R) Ethos(TM)-N NPU.

…r compaction (apache#12527) Hi, this change wants to add some minor updation to region estimator used by buffer compaction: - Add and clearify among `EstimateRegionStrictBound`, `EstimateRegionLowerBound` and `EstimateRegionUpperBound` Originally we have `EstimateRegionLowerBound`, actually it implements strict bound estimation IMO. Now add `upper` and `strict` version for where we actually want them. - When estimating upperbounds (eg. in buffer compaction), try estimate each dimension independently when they are dependent accesses where `EstimateRegionLowerBound` is expected to fail. Eg, `A[i, i], 3 < i < 16` fails via `EstimateRegionLowerBound` who check indices be independent. But we can still try best to invoke strict bound analysis on each dimension individually. - If range->extent == 1 for `EvalSet(range, dom)`, invoke `EvalSet(range->min, dom)` instead. Eg, `EvalSet([k*k, k*k+1), dom_k)` results to [-inf, +inf] due to current algorithm limitation but `EvalSet(k*k, dom_k)` results to a range which makes more sense.

This is clean up to use the new `target.features` instead of `IsaAnalyzer`.

…2568) `python.contrib.test_onnx.test_resize` is failing due to a numerical accuracy issue, reported in apache#12567. This patch marks that test as an xfail, so that other tests can be enabled, while this one is investigated separately.

Multiply can be supported when offloaded to the NPU by a conversion to a depthwise convolution operation. This is only supported when the multiply operation has a single single variable input with the other being a constant of shape [1, ..., C]. This commit adds a new pass "ConvertEquivalents" (name subject to change) to handle this conversion before codegen.

This PR exposes the following TIR operation in python: - `vectorlow`: tested [here](https://github.com/apache/tvm/blob/592148abf6866a41eefa736efca067d42f5aea86/python/tvm/tir/tensor_intrin/arm_cpu.py#L62) - `vectorhigh`: tested [here](https://github.com/apache/tvm/blob/592148abf6866a41eefa736efca067d42f5aea86/python/tvm/tir/tensor_intrin/arm_cpu.py#L79) - `vectorcombine`: add new unittest Co-Authored-By: yongwww <yongcale@gmail.com>

* working in parralel using worker * creating launchers per test and clean up * clean up * ci change to distrube tests * ci work with any number of devices * fix running on simulator * adding function docstring * fix android_serial_number to always return a list of string * lint issue * fix internal error when skipping tests while androideserial number is not set * lint issue

* Add pytest * lint

* [TOPI][Hexagon] Implement quantized avgpool * Fix pylint errors * Needed to adjust input padding for int8 buffer layout * Fix formatting issue * Add unit test for fixed-point conversion utility function Also, address review comments. * Remove pytest.skip for test_avg_pool2d_slice.py to enable on-target testing * Fix formatting issue * Update python/tvm/topi/hexagon/utils.py Co-authored-by: Christian Convey <christian.convey@gmail.com> * Update comments and error messages * Address review comments * Import Tuple from typing * Address pylint error Co-authored-by: Christian Convey <christian.convey@gmail.com>

When you build a project from existing project directory using `tvm.micro.project.GeneratedProject.from_directory` it would show up error if build directory previously existed.

…igned (apache#12519) When compiling tvm with micro on the compiler which implements char as unsigned(such as arm-linux-gcc), there is an error: `src/runtime/crt/graph_executor/load_json.c:218:12: error: result of comparison of constant -1 with expression of type 'char' is always false [-Werror,-Wtautological-constant-out-of-range-compare]` ` if (ch == EOF || ch == '\r' || ch == '\n') {` The reason is because the implementation of char is undefined, so it's better to specify here that it is signed.

This PR exposes the following TIR operation in python: - `shift_left`: tested [here](https://github.com/apache/tvm/blob/1afd0593956066635ee49297b731726c9218c91c/tests/python/unittest/test_tir_transform_simplify.py#L487) - `shift_right`: add new unittest Co-authored-by: yongwww <yongcale@gmail.com>

@Hzfengsy

…zation (apache#12544) cc @Hzfengsy @junrushao @junrushao1994 @masahi @spectrometerHBH

@junrushao

This PR exposes the following TIR operation in python: `tvm_load_matrix_sync`: tested [here](https://github.com/apache/tvm/blob/cd8fd9121deb22b078c9fe73cd8a554e6e7a0e15/tests/python/unittest/test_tvmscript_roundtrip.py#L711) `tvm_store_matrix_sync`: tested [here](https://github.com/apache/tvm/blob/cd8fd9121deb22b078c9fe73cd8a554e6e7a0e15/tests/python/unittest/test_tvmscript_roundtrip.py#L913) `tvm_mma_sync`: tested [here](https://github.com/apache/tvm/blob/cd8fd9121deb22b078c9fe73cd8a554e6e7a0e15/tests/python/unittest/test_tvmscript_roundtrip.py#L860) `tvm_bmma_sync`: add new unittest `tvm_fill_fragment`: tested [here](https://github.com/apache/tvm/blob/cd8fd9121deb22b078c9fe73cd8a554e6e7a0e15/tests/python/unittest/test_tvmscript_roundtrip.py#L571) Co-authored-by: yongwww <yongcale@gmail.com> cc: @junrushao cc @Hzfengsy @junrushao1994 Co-authored-by: yongwww <yongcale@gmail.com>

@masahi

This PR intends to add `aten::new_empty` which is used for model like `hf_Longformer`. cc: @masahi

@mehrdadh

Needed for apache#12587 @mehrdadh cc @Mousius @areusch @driazati @gigiblender

…ache#12585) This PR sets recommended heap size for qemu_x86 and NRF board to fix memory size with models like VWW using AoT host driven executor.

This PR adds a script that does a diff of skipped tests between the latest successful build on the main and the current branch. Then, it posts a comment with the report on the open PR. apache#11670

The code block in part Debuggging TVM is not showing up. Just fix it.

…e processed (apache#12578) In a recent change, `github.post` throws `RuntimeError` instead of `HTTPError` when the requested reviewer isn't a project collaborator. This prevents other reviewers to be added to the PR, for example, https://github.com/apache/tvm/runs/8001367110?check_suite_focus=true. This PR changes the caller to catch any exception so the execution won't be interrupted. Co-authored-by: driazati <9407960+driazati@users.noreply.github.com>

…2570) Removes tests previously marked as xfail since the issue has now been resolved.

In similar fashion to the conversion of mul to depthwise, this commit converts add when one input is a constant of shape [1, ..., n] to a depthwise convolution. If neither input is a constant, the add is offloaded naturally like before. The addition testing has been improved to use pytest features.

@sfvaroglu

…cales differ (apache#12577) cc @sfvaroglu @AndrewZhaoLuo

* Fix cuda codegen's fp16 inf literal * add relay testcase

* Revert "[skip ci] Revert "[ci] Default to n=2 for test parallelism (apache#12376)" (apache#12413)" This reverts commit 478b672. * [ci] Default to n=2 for test parallelism This is attempt #2 of apache#12376 which was reverted in apache#12413. The changes in `plugin.py` should keep all the tests on the same node so sporadic failures don't happen due to scheduling. Co-authored-by: driazati <driazati@users.noreply.github.com>

* Change default alignment to 64 bits. * Run dlpack test a few times. * Update alignment in tests. * Revert mma alignment change. * Change default printing of buffer. * Change crt runtime default allocation.

[Community] Wuwei Lin -> PMC

… with Relay (apache#12596) * Fix empty axis of `squeeze` in TOPI. * Add test case for `squeeze` with empty `axis`. * Add LLVM target for `test_squeeze`.

* Expose Memory Copy-Related PTX Builtins This PR exposes the following TIR operation in python: `ptx_ldmatrix`: tested `ptx_cp_async`: tested `ptx_commit_group`: tested `ptx_wait_group`: tested Co-authored-by: yongwww <yongcale@gmail.com> * apply code review suggestion Co-authored-by: yongwww <yongcale@gmail.com>

…o choose possible position (apache#12450) Current TIR "compute_at" primitive will compute at it's closest consumers. When a block has multiple producers, whoever compute at later who is behind. But for some special hardware, we usually hope keep the a certain order whatever it's compute at early or late. eg: block A and block B are producers of block C. block A compute at block C first and block B compute at block C later. We hope the result is block B->block A->block C under some loop var.

* add simplify for dq->arg funcs * add comments, fix lint * move comments to the right spots

Enables AutoTVM-style, template-based tuning for Hexagon. To run compiled code on Hexagon, we need to use Hexagon `Session` object https://github.com/apache/tvm/blob/dc522a6ff65b68532cd1bba43827cd981114df2c/python/tvm/contrib/hexagon/session.py#L35 in the metaschedule `RPCRunner`. But for RPC "session", `RPCRunner` expects an instance of `RPCSession`, https://github.com/apache/tvm/blob/53fe5966823eee4e011d7228bceab3c82c1d9caa/python/tvm/rpc/client.py#L32, to be created and used by various customizable functions. Since `RPCSession` and Hexagon `Session` have slightly different API, we cannot use `RPCRunner` with customizable functions directly. So I introduced an alternative implementation of `RPCRunner` for Hexagon. The test is disabled for simulator since `HexagonLauncherSimulator` is not pickle-able due to its `multiprocessing.Process` attribute: https://github.com/apache/tvm/blob/c97895e0ffb512e73c89de7cdee9846f052244fc/python/tvm/contrib/hexagon/build.py#L614 Output log from tuning `vrmpy` dense (included in the test) ``` ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated -------------------------------------------------------------------------------------------------------------- 0 | main | 150994944 | 1 | 380.3399 | 397.0000 | 397.0000 | 32 | -------------------------------------------------------------------------------------------------------------- ```

Previously, the `TVM_SREF_TO_BLOCK`, `TVM_SREF_TO_FOR`, and `TVM_TYPE_AS` macros required both the input and output variables. The input variable name is useful for improving the error message returned, but the output variable name isn't necessary for this functionality, and prevents the macro from being used as part of an expression. * Generate an immediately-invoked lambda expression to allow for an independently-scoped `result` variable. * Use parentheses around the input argument, in case the sref is the result of an expression. * Update all call sites to remove the macro argument providing the first argument.

The new image has xgboost installed, which I need for apache#12587 Validated in https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/279/pipeline

The problem with greedy lexing of >> as an operator was solved in C++11, and now templates no longer require spaces between >'s.

Adds adds asynchronous DMA support through the Hexagon User DMA engine with unit tests to validate basic functionality. Asynchronous DMA support here means the ability to "kick off" asynchronously a number of DMAs using the Copy API and then to Poll for or Wait on a number of "in flight" (not done) DMAs. Enables future testing and development for asynchronous memory copy on Hexagon. For now, Hexagon DMA support remains synchronous in nature through existing hexagon_user_dma_1d_sync interface which uses asynchronous capable HexagonUserDMA class in a synchronous way --- calling Copy and Wait back to back for each request. * use ring buffer to store DMA descriptors * add RingBuffer class; used by HexUserDMA to store descriptors * add test to overflow the HexagonUserDMA ring buffer

`ApplyHistoryBest` right now plays a role as the database adaptor to query inside the database. In fact, the logic could be simplified and users only have to deal with `Database` instead of this extra object. - [x] Add `EnterWithScope`/`ExitWithScope`/`Current` to Database - [x] Migrate `te_filter_func` => "tir_filter" in Relay's pass context - [x] Migrate `f_take_tuning_record` => "Database.query_tuning_record" - [x] Migrate `TECompiler` to use `Database` - [x] Remove apply-history-best Next PR: - Migrate `f_direct_dispatch` (potentially unify with `apply_fixed_schedule`?)

Expose MMA-related PTX builtins This PR exposes the following TIR operation in python: `ptx_mma`: tested `ptx_mma_sp`: tested `mma_store`: add new unittest `mma_fill`: add new unittest Co-authored-by: yongwww <yongcale@gmail.com> Co-authored-by: yongwww <yongcale@gmail.com>

Following apache#12520, this PR introduces `ScheduleFnDatabase`, a mocked database to allow injecting handcrafted schedules provided by a schedule function. The schedule function comes with the following signature: ```python def schedule_fn( sch: tir.Schedule, ) -> bool: task_name = sch.mod.attrs["task_name"] # ^^^ provides an optional name of the task queried ... ``` This mocked database helps incorporate the existing testing utility `apply_fixed_schedule` more formally into the MetaSchedule-Relay build pipeline, and allows further extension to Relax with the same interface. Next as another follow-up, we will introduce ConcatDatabase that allows mixing multiple databases, including the mocked and ones from JSON files.

* [Refactor] Replace std::tie with structured bindings With C++17 enabled in apache#12337, using structured bindings to replace cases where `std::tie` is used to define local variables. * Added missing header for <optional> * Silenced unused variable warnings after structured bindings This is a bug in gcc version 7, resolved in gcc 8. While gcc version 7 is used for CI, we'll need to silence unused variable warnings resulting from using only part of a structured binding. More information: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81767

* [QNN] Align output_scale/zero_point of sigmoid to Torch * [QNN] Align output_scale/zero_point of sigmoid to Torch

…pache#12620) add xfail

The timestamp in the Jenkinsfile is there to prevent post-merge conflicts from different PRs that edit the templates merging non-sequentially. This is not an issue when a line is edited in place though, which is often the case when Docker image tags are updated. This PR makes it so the timestamp is not updated in these cases which should reduce merge conflicts on these types of PRs.

Previously, `Callable` was handled as an atomic type. This worked when it was included as last element of a `Union[]` annotation with no subtypes, but raised an error for other use cases, including `Optional[Callable]`. This commit adds explicit checks for `Callable` type annotations to validate whether the argument is callable, but doesn't recursively validate the signature of the callable object, because lambda functions cannot have type annotations. (https://peps.python.org/pep-3107/#lambda)

…#12638) Previously, type-checks in boolean operators on `PrimExpr` would state that the type is incorrect, but further investigation would be required in order to determine what expression caused the error. After this commit, error messages for these type checks include the expression that was used, and the dtype of that expression.

[CI] Update Hexagon image to install boost (apache#12613) The new image has xgboost installed, which I need for apache#12587 Validated in https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/279/pipeline Co-authored-by: masahi <masahi129@gmail.com>

…rinter (apache#12618) * support float inf, -inf and nan in TVMScript parser and printer * address comment and fix lint * use type_extensions.Literal * address comments * fix win build * remove template

When I built keyword spotting ONNX model, there was an issue with the pool schedule because certain schedules like broadcast and elemwise do not have input tensors.

* Fix test utils * Update python/tvm/micro/testing/utils.py Co-authored-by: driazati <9407960+driazati@users.noreply.github.com>

Expose Hexagon gtest output in CI by raising it as a runtime exception rather than printing it to stdout.

…pache#12642) This PR adds missing CMSIS-NN source files to Zephyr cmake template file for models like keyword spotting, anomaly detection, VWW and image classification.

This makes it so changes to certain files from users not listed in `CONTRIBUTING.md` are not tested in CI. This is necessary since these scripts run on the baremetal EC2 instances and not inside Docker containers, so they can affect other builds and potentially grab Jenkins secrets. This checks out the version from the upstream for the listed files after running `git checkout`. Tested in CI: [positive](https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-12604/6/pipeline/) and [negative](https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-12604/9/pipeline/)

…e#12648) * Complete winograd scheduling. * Fix test.

…12594) Fixes the case when reshape is > 4 dims. While this cannot be offloaded to the NPU, the check was previously producing an error preventing further compilation. The correct behavior is to ensure the check returns False and not offload the reshape.

…ffer var (apache#12412) * Updated TVMScript syntax of `T.allocate` to return buffer var. * Added syntax sugar for `T.decl_buffer`. When `data` field is not specified, `data` will be implicitly created via `Allocate` stmt. * Updated the existing test cases. Most test cases can be updated by changing `T.allocate` to `T.decl_buffer`. `T.allocate` in some tests are updated to `T.allocate` + `T.buffer_decl`, to maintain the legacy behavior of allocation and implicit buffer declaration (will be followed up in future PR to adopt `T.decl_buffer`).

…pache#12660) This patch makes test_load_model___wrong_language__to_pytorch to be skipped in AArch64 due to a bug that can be reproduced when enabling Integration Tests in machines with Torch installed in TVM. ``` The error message seen is: OSError: /usr/local/lib/python3.7/dist-packages/torch/lib/ libgomp-d22c30c5.so.1: cannot allocate memory in static TLS block ``` While the test needs further investigation, it is being set as skipped so other tests can be enabled and not to regress and allow time for the investigation to be made. This relates to the issue described in apache#10673.

* [skip ci][ci] Fix Jenkinsfile (apache#12387) This got out of date after merging apache#12178 Co-authored-by: driazati <driazati@users.noreply.github.com> * Address comments Co-authored-by: driazati <driazati@users.noreply.github.com>

…che#12661) Previously, the argument needed to be an integer specifying the index into the read/write regions of a block. Now, the argument can be a string specifying the name of the buffer, or the Buffer object itself. This is a follow-up from apache#11624.

This pr fixes pylint errors in tests/python/contrib/test_ethosn as reported in issue apache#11414.

[Relay] Extract Intermediate Expr by relay expr ID for analysis modify doc comments Co-authored-by: Bin Li <binli1@amd.com>

…he#12659) This commit adds high-performance implementation of fixed_point_multiply operation based on Hexagon intrinsics for vmpye/vmpyo instructions. Benchmarking of 'fixed_point_multiply' op with (1,8,56,56,32) input tensor on Qualcomm SM8350: * default implementation: 10.06 ms * optimized implementation: 1.42 ms * speedup: 7x times (!!!) Please note that this is introducing a small round-up error for some corner cases with negative shift argument (The same as for ARM CPU, see PR#5980). This is because we are rounding twice instead than only once: * original q_multiply_shift: round(x*y*2^-s) * hexagon q_multiply_shift: round(round(x*y)*2^-s)

…#12668) fix autoinline and add test

* Add methods to read and restore late-bound constants on Executable. * Add bindings for new functions * Cleanup * Fix function name * Add tests for python API to access new load/save functions * Add another tests for python API to access new load/save functions where there are no constants

) * [Adreno] Change compute/schedule for ToMixedPrecision pass * Address CI fails * address PR comments * Fix AutoTVM flow

- Re-enable test_max_pool2d_slice.py when run on Hexagon hardware (as opposed to hexagon-sim). This is now safe because apache#11928 has been fixed.

…he#12628) Following up apache#12520 and apache#12626, this PR introduces two database classes: `UnionDatabase` and `OrderedUnionDatabase`, both of which allow users to organically compose multiple databases together, so that the high-level IR (Relay, Relax) could select the best tuning records according to running time or a preferred order given by users. To each query, `UnionDatabase` returns the best record among all the databases given; Instead, `OrderedUnionDatabase` returns he record from the first database that responds to the query. Used together, users may specify complicated dispatching patterns like below: Examples below demonstrate the usecases of and difference between UnionDatabase and OrderDatabase. Assumption: * db1, db2 do not have tuning records for the target workload. * Each of db3, db4, db5 has tuning records r3, r4, r5 for target workload respectively. ```python #### Case 1. `UnionDatabase`: merged_db = ms.database.UnionDatabase( db1, # no record db2, # no record db3, # has r3 db4 # has r4 ) # returns the better one between r3 and r4 merged_db.query_tuning_record(..., target_workload) ### Case 2. `OrderedUnionDatabase` merged_db = ms.database.OrderedUnionDatabase( db1, # no record db2, # no record db3, # has r3 db4 # has r4 ) # returns r3 merged_db.query_tuning_record(..., target_workload) ### Case 3. Mix-use scenario merged_db = ms.database.UnionDatabase( db1, # no record db2, # no record db3, # has r3 ms.database.OrderedUnionDatabase( # returns r4 db4, # has r4 db5, # has r5 ) ) # returns the better one between r3 and r4 merged_db.query_tuning_record(..., target_workload) ### Case 4. Another mix-use scenario merged_db = ms.database.UnionDatabase( db1, # no record db2, # no record db3, # has r3 ms.database.UnionDatabase( # returns the better one between r4 and r5 db4, # has r4 db5, # has r5 ) ) # returns the best one among r3, r4 and r5 merged_db.query_tuning_record(..., target_workload) ### Case 5. Yet another mix-use scenario merged_db = ms.database.OrderedUnionDatabase( db1, # no record db2, # no record ms.database.UnionDatabase( # returns the better one between r3 and r4 db3, # has r3 db4, # has r4 ) db5, # has r5 ) # returns the better one between r3 and r4 merged_db.query_tuning_record(..., target_workload) ``` Co-authored-by: sunggg <49998730+sunggg@users.noreply.github.com>

@cyx-6

Please join me in welcoming Yaxing Cai (@cyx-6) as a new reviewer in TVM. Yaxing has brought the PackedFunc into TVM object system ([RFC-051](apache/tvm-rfcs#51)), designed and implemented the new parser infrastructure for TVMScript and meta-programming ([RFC-079](apache/tvm-rfcs#79)) - [Commits History](https://github.com/apache/tvm/commits?author=cyx-6) - [Code Review](https://github.com/apache/tvm/pulls?q=reviewed-by%3Acyx-6+)

fix arange for pytorch nightly 20220815

This PR introduces a set of `.create` methods making it easier to create MetaSchedule objects. For example: ```python ms.database.JSONDatabase(...) ms.database.create("json") ms.runner.RPCRunner(...) ms.runner.create("rpc") ``` Besides, this PR allows `JSONDatabase` to be created via `work_dir`: ```python db = ms.database.create("json", work_dir="/path/to/db/") db = ms.database.create(work_dir="/path/to/db/") # or even simpler ```

Fixing a few more pylint issues caught when using pylint==2.9.3. Change-Id: Ie7ca61e1a8083a40e0ffccf1418192966884707a

Supports offloading concatenate with a negative axis to the NPU. In addition, parameterized the concatenate unit tests.

This fixes the issue where merging from GitHub Actions (i.e. with the default `GITHUB_TOKEN`) doesn't trigger post merge GitHub Actions on the commit it creates in `main`. Instead these jobs are triggered manually by a call to the Actions API after the merge has taken place. This also updates the tvmbot testing code (and by extension some of the other CI testing code) to remove the fixtures for each test in favor of constructing them from a single sample at runtime, this makes it a lot easier to add new tests and see what is different between each data sample and clean up the testing anti-patterns that were there before (e.g. `run()` instead of `pytest.mark.parameterize`, but none of the tests in `test_ci.py` have changed) Tested in driazati#36 which ran https://github.com/driazati/tvm/actions/runs/2881047903

Example: ```bash python -m tvm.autotvm.testing.tune_relay \ --workload bert_base \ --input-shape '[1,64]' \ --target "llvm" \ --num-trials 800 \ --rpc-host 192.168.6.66 \ --rpc-port 4445 \ --rpc-key 3090ti \ --work-dir /logs/autotvm-bert_base \ --cache-dir /cache-workloads \ --graph-tuner True \ --cpu-flush True \ --backend graph ```

This adds some checks for the current usages of the PR linter and fixes the case where the script would error uncleanly when a PR body was `null`.

There are now many warnings in the tuning process about undefined memory information when using textures. A definition is required as textures* are tagged.

As a follow-up to apache#12337, updating the EMCC flags from `-std=c++14` to `-std=c++17`.

) Using pytest parameterization helps identify the particular parameter combinations that are failing for a given test. Additionally, it can be useful when parallelizing the tests. This commit makes sure that "trials" have been replaced by parameterization as well as completing a general cleanup.

…#12710) At the moment, android camera is installing latest TF and Keras which is causing the following issue in CI: ``` File ".../keras/dtensor/lazy_variable.py", line 26, in <module> from tensorflow.python.trackable import base as trackable ModuleNotFoundError: No module named 'tensorflow.python.trackable' ``` This patch fixes the versions in the last known working versions of both: TF 2.9.1 and Keras 2.9.

- Introduce 'global.ddr' memory scope: - Like 'global', this allocates memory from the Hexagon SoC's DDR memory. - Like 'global.vtcm', the specified tensor shape must be 1d or 2d, where 2d indicates Hexagon's "indirect tensor" (i.e., discontiguous) allocation scheme. - Change memory-alignment strategy to always be 2048-byte aligned on Hexagon. (This can be refined in the future, but for now it ensures all allocations meet the strictest alignment requirements for any Hexagon operations.)

…pache#12655) * [TIR][StorageRewrite] Allow in-place buffer reuse of non-flat memory Previously, shared buffer use was entirely disabled for non-flat memory, since the existing checks for shared memory assume flat 1-d spaces. This was enforced in `FindAlloc` and validated in `PrepareNewAlloc`. The validation in `PrepareNewAlloc` could trigger, if the buffer sharing was due to an in-place operation, and not through the `FindAlloc` function. In-place operations do not require N-d packing, nor do they introduce ambiguity in how different code generators may interpret non-flat physical indices. Therefore, this commit relaxes the validation in `PrepareNewAlloc`, allowing buffer reuse of non-flat buffers for in-place operations. * Update new StorageRewrite with correct allocate/buffer_decl usage

Motivation: In case of quantized models nn.pad operation typically is not fused with QNN ops and lives as a standalone operation. In this case it uses default injective schedule for Hexagon target and it is not optimized very well (based on analysis of real models like ResNet50 INT8). What was done: New schedule for Pad operation was implemented instead of default injective schedule. For Hexagon target injective schedule does fusion of all axis and vectorization on 128/64/32 (depends on dtype). It works fine for Add, Sub, etc... but not for Pad. New optimized schedule does these steps (fusion+vectorization) only if last tensor dimension is divisible by 128/64/32 (depends on dtype). It was done only for Hexagon, for other targets (x86, cuda, etc.) there is no changes and it uses default injective schedule. Benchmark results on Snapdragon 888: 4d NHWC layout with ((0, 0), (1, 1), (1, 1), (0, 0)) padding, "uint8" dtype: shape | default schedule, ms | optimized schedule, ms | speedup | -------------------|----------------------|------------------------|-------------------| (1, 112, 112, 32) | 10,03 | 0.2 | 50.1x times | (1, 56, 56, 128) | 0,099 | 0,085 | ~1x (no speedup) | ---------------------------------------------------------------------------------------| 4d NCHW layout with ((0, 0), (0, 0), (1, 1), (1, 1)) padding, "uint8" dtype: shape | default schedule, ms | optimized schedule, ms | speedup | -------------------|----------------------|------------------------|-------------------| (1, 128, 56, 56) | 10.96 | 1.38 | 7.9x times | (1, 32, 126, 126) | 1.66 | 1.58 | ~1x (no speedup) | (1, 32, 128, 128) | 13.98 | 2.66 | 5.25x times | ---------------------------------------------------------------------------------------| 5d NCHWc layout with ((0, 0), (0, 0), (1, 1), (1, 1), (0, 0)) padding, "uint8" dtype: shape | default schedule, ms | optimized schedule, ms | speedup | -------------------|----------------------|------------------------|-------------------| (1, 4, 56, 56, 32) | 6.39 | 0.29 | 22x times | (1, 56, 56, 128) | 0.15 | 0.15 | ~1x (no speedup) | ---------------------------------------------------------------------------------------| Summary: For some input tensors we get up to 50x times speedup, for other performance is the same. No performance degradations were detected.

* [TVMC] Run module once by default Currently executing `tvmc run module.tar` will run the input model twice. For benchmaking this is to be expected as the first run is used to prime caches etc before taking a measurement. However, this seems a bit unintuitive to have as default, especially when benchmarking is not always intended. In this sense, this commit aims to amend the number of runs for the default: `tvmc run module.tar` to a single run. After inspection, this seems to be down to the use of the `.benchmark()` method which runs (1 + repeat * number) executions in total. This means that at least two runs are required (i.e. when repeat=1, number=1). It also seems that it is only necessary to benchmark the model when `--print-time` has been set from the CLI POV. From the python interface POV, benchmarking is always run, but this may not always be necessary. This commit makes use of the `.run()` method to singularly execute the model by default. From the CLI this will be used when `--print-time` is set to False whereas from the python interface this will be used when `benchmark=False`. Otherwise, the `.benchmark()` method will be used as before. Complementary to this change `repeat`, `number` and `end_to_end` parameters are only used when either `--print-time` or `benchmark` are set to True - and the documentation has been updated to indicate this. Change-Id: I18a38a9d430d660264f7fce5caf0779aa059fed3 * improve documentation with number of exectuions when benchmarking Change-Id: Iecf557594420fcc9f3abcec5ce7d952db2c94271

This commit adds the Commit Message Guideline text to Apache TVM documentation in ./docs/contribute/pull_request.rst, under section 'Submit a Pull Request', below subsection 'Guidelines', as a subsection named “Commit Message Guideline”. The text in the second-last item in subsection 'Guidelines' that mentions PR tags is also updated to refer to this guideline. This documentation will help guide contributors on how to write good commit messages when submitting code / creating Pull Requests, in accordance with RFC-0088: https://github.com/apache/tvm-rfcs/blob/main/rfcs/0088-commit-message-guideline.md

…pache#12699) Current LoopPartition doesn't check the value of attribute key "pragma_loop_partition_hint". Whatever I set pragma_loop_partition_hint to True or False, the result is same, which is confused for debug. This PR fix pragma_loop_partition_hint attribute key should check it's value.

Adds support for offloading transpose convolution with an optional bias to the NPU. Co-authored-by: Samuel Panijel <samuel.panijel@arm.com> Co-authored-by: Leo Blonk <leo.blonk@arm.com>

…e#12718) * add spped optimization flag * trigger * add exception for qemu_riscv64

dequantize op hexagon

Follow-up from apache#12337 and apache#12693, updating a few additional locations that specified C++14.

* IRBuilder methods for `IRModule` This PR introduces IRBuilder methods for `IRModule`. Co-authored-by: yongwww <yongcale@gmail.com> * apply code review suggestion Co-authored-by: yongwww <yongcale@gmail.com>

This updates the TF version to be used in TVM CI to 2.9.1, which brings improvements so that more platforms are supported by official packages. When building TFLite, an update to CMake was also required, which is updated now to 3.18.4. ethos-u-vela dependency is also updated, from version 3.2.0 to 3.4.0 so that it is closer to the TensorFlow version being proposed here. This PR updates the Docker images scripting to install TF and TFLite. Change-Id: I290085f0c018ad57606f1295494c19ff6e1af2dd

Addresses this CI failure on `main`: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/main/4235/pipeline/ Co-authored-by: driazati <driazati@users.noreply.github.com>

Replace '> >' in templates with >>, NFC (apache#12615) The problem with greedy lexing of >> as an operator was solved in C++11, and now templates no longer require spaces between >'s. Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com>

…titionConfig to unroll loop (apache#12631) [TIR] Add unroll_loop_with_partition_hint_no_interval attr in LoopPartitionConfig to unroll loop

apache#12711) * [OpenCLML] CLML Profiling fixes corresponding to OpenCL Timer recent changes. * [OpenCLML] Review comments. * * review comment

…ache#12671) Currently, Relay QNN uses its `helper_no_fast_int8_hw_legalization` to convert most `int8` convolution and dense operations into `int16` ones on Arm. This currently occurs on ARM chips except for `v8.2a` chips with `dotprod` support. However, this behavior means that `int8` operations are replaced with `int16` ones on Cortex-M chips. On these chips `int16` is substantially slower, as while it saves a few sign extension operations, it doubles the amount of memory loads we need to perform. This PR changes when `helper_no_fast_int8_hw_legalization` is used on Arm, and instead makes **not** doing this replacement the standard. We will only do this replacement if we are on a chip with ASIMD support but without `v8.2a` and `dotprod`. This ensures that Cortex-M microcontrollers do not have `int8` operations turned into `int16` ones. I have also verified that this does, in fact, improve performance for some common models. For example, MobileNet_v1_0.25 on the Cortex-M4 saw a 10% performance improvement, compared to before this change. Accuracy does not seem to be affected.

Some integration tests are not being run on CI due to the configuration of the machine with onnx and torch not calling the integration tests script. This patch skips two more tests failing with the error message below: ``` "OSError: /.../torch/lib/libgomp-d22c30c5.so.1: cannot allocate memory in static TLS block" ```

This patch marks two tests as xfail for further investigation: * test_meta_schedule_integration_extract_from_resnet_with_filter_func * test_meta_schedule_integration_extract_from_resnet

Create a specific test dependency to map to USE_LIBTORCH, which is disabled by deafult, and is independent from torch being installed on the underlying machine, so it causes problems in machines that have torch installed but TVM is build with USE_LIBTORCH OFF. Mark tests.python.contrib.test_libtorch_ops.test_backend with this new decorator.

* [TIR] Moved tir.FlattenBuffer to occur before tir.LowerOpaqueBlock For buffers with more than one physical axis, the `axis_separators` are required in order to know which groups of logical axes to fuse into each physical axis. The implementation in `tir.FlattenBuffer` assumed that all buffers were being flattened to a single physical axis. Because `tir.LowerOpaqueBlock` replaces the `BlockNode::alloc_buffers` with `Allocate` nodes, `tir.FlattenBuffer` no longer has access to the axis separators and performs inconsistent flattening for `Allocate` as opposed to `BufferLoad`/`BufferStore`. This was introduced in apache#12172, which decoupled the lowering/flattening steps. The commit reorders the `tir.FlattenBuffer` to occur before `tir.LowerOpaqueBlock`, to make use of the axis separators. Any `Allocate` nodes that exist at that point (e.g. from hand-written schedules) are still flattened to 1-d physical buffers, but the `BlockNode::alloc_buffers` are flattened according to the axis separators. * Add unit test to validate non-flat memory after tvm.lower * Explicitly write T.reads for test on BufferRegion updates * Update incorrect docstring for test * Use DeclBuffer information in FlattenBuffer The DeclBuffer node can be inserted during LowerOpaqueBlock, then provide the missing Buffer information required to flatten the allocation. * Use T.allocate in unit tests With the insertion of `DeclBuffer` nodes, `LowerOpaqueBlock` no longer needs to be before `FlattenBuffer`, and has been moved back to its original position. Revering the tests to use `T.allocate` instead of `T.alloc_buffer` more closely represents the functions as they are being lowered. * Fix usage of T.decl_buffer in updated tests * Update LowerOpaqueBuffer to expect the DeclBuffer nodes * Strip DeclBuffer annotation in FlattenBuffer The DeclBuffer annotations aren't yet supported in all passes. This restricts them to being introduced in LowerOpaqueBuffer, then immediately removed in FlattenBuffer. * Strip out all DeclBuffer nodes in FlattenBuffer * Update unit tests to remove expectation of DeclBuffer nodes

Prior to this commit, `ReplaceBufferMutator` only checks `BufferRegionNode::buffer` to determine if a `BufferRegion` needs to be replaced, and doesn't check the `BufferRegionNode::region`. As a result, updating `T.reads(A[B[i]])` would fail to replace `B`. This commit checks `BufferRegionNode::region` for buffer usage to resolve this issue.

…apache#12678) * Move static array initialization into a function go avoid link errors * Fix line length

…e_inline (apache#12717) * [TIR, Schedule] Generate consumer-in-bound predicate after reverse_compute_inline * Check consumer block iters are covered * fix lint

The Zephyr project builds require 3.20.0 to work correctly Co-authored-by: driazati <driazati@users.noreply.github.com>

…he#12738) [CI] Update Docker images to tag 20220908-060034-62bdc91b1 Updates all Docker images to tag 20220908-060034-62bdc91b1, to update TensorFlow/TFLite/Keras to 2.9, and cascaded dependencies such as numpy. Updates ethos-u-vela to 3.4.0. It also brings ONNX and PyTorch to ci_arm, to enable Integration tests to be run in CI. Standadises the minimum CMake version required in CI to be 3.18.4, fixing apps/microtvm/zephyr_cmsisnn to require this version. Finally, adds a new import error in the tutorials documentation which doesn't affect the final result. The new warning added is 'absl:Found untraced functions such as _jit_compiled_convolution_op'

Aligned CMSIS-NN SHA in TVM to top of tree of CMSIS. -Aligned buffer size APIs to CMSIS implementations. -Updated the tests to match new CMSIS context buffer sizes. -This change needs updates to cortex-m docker image. Change-Id: I13f1ad29fe0ef02f08660eca4c818b5d66145ffc

…gs (apache#12741) * add nucleo overlay

Base IRBuilder methods for `PrimFunc` This PR introduces base IRBuilder methods for `PrimFunc`. Co-authored-by: yongwww <yongcale@gmail.com> Co-authored-by: yongwww <yongcale@gmail.com>

Previously, it was ambiguous whether `BlockNode::iter_vars` were in-scope for `BlockRealizeNode::predicate`. `ConvertBlocksToOpaque` treated them as in-scope, and applied a mapping from `iter_vars` to `iter_values`. Similarly, TVMScript printing places `T.where` statements below the `T.axis` statements, where `T.axis` definitions are in scope. However, `BlockRealizeNode::SEqualReduce` and `BlockRealizeNode::SHashReduce` do not visit the block and `iter_vars` until after visiting the predicate, placing the `iter_vars` out of scope. This commit updates the printing of `T.where` to be above `T.axis`, and updates `ConvertBlocksToOpaque` to report an error if the predicate contains references to `BlockNode::iter_vars`. After this commit, these three usages all consistently treat `BlockNode::iter_vars` as out of scope for `BlockRealizeNode::predicate`.

* Add opencl target in test build script * Fix fp16 test and compile test for opencl * fix lint * Fix relay OpenCL texture tests * Fix lint * Enable relay OpenCL tests * Fix opencl relay texture tests * fix lint * Remove OpenCL gtest variable * Fix unbound variable * Skip tests that are not supported in CI * fix lint * Add path for opencl gtest directory * Fix opencl gtests include directory * Enable OpenCL googletest. Fix bug in opencl timer test * testing fix for build cpp tests * update googletest git version for opencl tests build * update cmakelist * Update CMakeList * Update CMakeList * Disable opencl googletests * update Opecnl.cmake * fix Opecnl.cmake * Apply comments. Remove xfail decerator for opencl tests. Now specific tests are skipped in the environment script * minor code changes * apply comments * apply comment * skip test in ci by decorator * fix pytest skipif warnings * Fix skipif for opencl gtests

…apache#12658) [Frontend][Paddle] Fix adaptive_avg_pool2d in paddle did't transmit layout information

apache#12515) * Add more strict check in tir imm construction and folding. * fix bool-compare compile error * fix some illegal imm construction in testcases * do not test i64 overflow behaviour because it is not consistent on cython and ctypes * fix float32 testcase * auto-inferred dtype should be int64 when value exceeds int32 range * add floatimm range check for fp16 and fp32 * add more folding testcases and fix store fp32 folding result to double * fix i386 fp16 cases

* [TOPI][Hexagon] Add test and schedule for uint8 resize2d * Fix correctness issue * Reformat * Remove cubic from testing * Remove unnecessary else

…2606) * [TOPI][Hexagon] Add test and schedule for uint8 resize2d * Fix correctness issue * Reformat * [TOPI][Hexagon] Implement quantized elementwise * Reformat * Address review comments * Reformat * Revert * Address review comments

Updates the driver stack used by the NPU to the latest released version (semantic version 3.1.0), while maintaining backwards compatibility for the previous version 22.05 (semantic 3.0.1) during the migration period. In addition, support for split is re-introduced as this is now supported in 22.08. Change-Id: I86bce3469f0b8ad52e66461ae055dec6717b3527

This PR introduces base IRBuilder methods for `Block`. Co-authored-by: yongwww <yongcale@gmail.com>

…12704) fix typo of compare between GlobalVar and str

This PR changes all ci_ to install TVM Python dependencies in a virtualenv separate from the system Python dependencies. Sets the stage for adding the poetry-based dependency generator to the CI container build process. * Always install into a python venv in ci containers. * Respect Dockerfile ENV PATH modifications in docker/bash.sh lookups.

* [Hexagon] Add Hand written HVX conv2d Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com> * Address review comments Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com> * Add some more comments and a file rename * Add gtest unit tests for blockize/deblockize * Add gtest unit tests fp16 utils Co-authored-by: Krzysztof Parzyszek <kparzysz@quicinc.com>

Support GREATER quantization operation conversion as part of issue apache#9187 Continuation of apache#11519.

…che#12662) Previously, the test cases only tested TE-based schedules. This commit runs the same tests for equivalent TIR-based schedules as well. This is intended to catch Hexagon-specific regressions, such as the one resolved in apache#12652.

This PR introduces a couple of fixes to make AutoTVM working more robustly: - Fixed a very rarecase that `None` could pop up in AutoTVM features; - Fixed a misuse of `ARGS` in the testing script; - Fixed the filename for caching.

This PR migrates the usage of `check_trace` to `check_sketch`, which prefers structural equality of TIRs insteda of string equalty of traces.

…12764) * Migrate AutoBind * Migrate RandomComputeLocation * Migrate CrossThreadReduction * Migrate ParallelVectorizeUnroll

…on. (apache#12667) * [Hexagon] Increase max buffer size for tvm_rpc_android to 1GB. * [Hexagon] Make errors more clear when unable to allocate VTCM buffers and throw an error to fail early. * [Hexagon] Add mem_copy_DLTensor to enable directly calling DMA for mem copies. * [Hexagon] Add new tests as examples of the performance to expect when copying data to VTCM. * [Hexagon] Reduce rpc max size. * [Hexagon] Fix test_parallel_hvx_load_vtcm.py test output to be human readable. * Comment out tests that only work on 8Gen1 HDKs to get CI to pass

* support fp32 constants in quantized bias add * add a test * clean up comment * assert the bias is floating point as well as constant before requantizing

fixed apache#9955, this is covered by the existing test case `tests/python/relay/test_op_level3.py::test_unique`

) This is a pass refactored out of the AOTExecutorCodegen. Instead of combining all of the functionality of the AOTExecutorCodegen into a single monolithic pass, this pass only handles the lowering of the Relay main function into TIR. Tests for the pass are included.

Added operators pooling (avg, max), binary operators (add, subtract, multiply, min, max) and concat. Clip operator with min=0 and max=6 is remapped to relu6 to take advantage of CLML acceleration without sub graphing this to fallback path. Added new test cases for above listed operators and also end-to-end network test cases for Resnet50 & InceptionV3. CLML support FP16 arithmetic mode which gives significant performance boost over FP32. This PR enhances FP16 usage based on Operator datatype in relay graph. Co-authored-by: Krishna Raju quic_kvegiraj@quicinc.com Co-authored-by: Shwetank Singh quic_shwesing@quicinc.com

…che#10516) * [Relay][TE] Use Relay parameter name to generated TE tensor name Previously, the TE placeholders representing relay function parameters were all named `"placeholder"`, which could be difficult to follow when debugging larger functions.

…pache#12456) The dependencies for these have moved into ci_cortexm Docker image, so there is not much point in building them for ci_cpu as we can't run the associated tests.

This PR introduces remaining IRBuilder methods for `PrimFunc`. Co-authored-by: yongwww <yongcale@gmail.com>

[TIR][MetaSchedule] Support Tuple Reduction This PR improves our TIR scheduling primitives/transformations (rfactor & cross-thread reduction) designed for reduction operators, so that they can be applied to blocks of tuple-reduction.

) Following change introduced installing python dependencies inside virtual environments: apache#12663 Previous to this fix, a different version of python was being picked up that didn't catch the issues fixed in this commit. Change-Id: Ie290d9474a799311e07d293fa1b8299326b11661

…pache#12756) * [microTVM][Zephyr] Fix PLL freq. in overlay for nucleo_l4r5zi board Commit 1d32c40 ("Add project overlay to overwrite device tree configs") added overlay for setting 'clock-frequency' property of node 'rcc' to 120 MHz, however to effectively change the PLL frequency that drivers the core it's necessary also to overlay the attributes for the 'pll' node. This commit does that. Signed-off-by: Gustavo Romero <gustavo.romero@linaro.org> * Remove div-p and div-q properties from overlay Remove div-p and div-q properties from the overlay file since values for these properties will be inherited from the 'pll' that is overlaid. Since currently microTVM does not use any subsystem which relies on clocks associated to either P or Q params, these params can be left unchanged for now. Signed-off-by: Gustavo Romero <gustavo.romero@linaro.org>

…#12784) Prior to this commit, the templated `TryConstFold` utility returned an undefined `PrimExpr` to represent a failure to perform constant folding. This commit makes this explicit by returning `Optional<PrimExpr>` instead.

* [TIR, Schedule] Add schedule primitive PadEinsum Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> * lint * [TIR] Fix producer indices check in PadEinsum * address comments * simplify lambda expr * fix Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>

[Arith] Simplify nested if_then_else Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>

…mage to enable RISC-V unit testing (apache#12534) * Remove CSI-NN from ci_cortexm docker image * [Docker] [RISC-V] Split up CSI-NN2 installation script into several files [Docker] [RISC-V] move gcc toolchain installation out of csi-nn2 script [Docker] [RISC-V] move qemu installation out of csi-nn2 script * use updated version of qemu * [Docker] [RISC-V] Install newlib (baremetal) gcc toolchain * [Docker] [RISC-V] Install spike simulator * [Docker] move initialization of timezone and DEBIAN_FRONTEND to ubuntu_install_core.sh script

…odule (apache#12747) Hopefully fixes apache#12742, as the warning should only be printed when a user passes `target_host`, in the current case if the user passes `None` as `target_host` it'll be processed by `canon_target_map_and_host` which seems to always produce a `target_host` and thus triggering the warning despite the user doing nothing wrong.

This should mitigate failures like in https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/main/4274/pipeline. This also moves the `retry` function to a script now that we have PR apache#12604. Co-authored-by: driazati <driazati@users.noreply.github.com>

This should fix some version drift in the current cmake versions in the Docker containers (currently running all of 3.10, 3.16, 3.18, and 3.20) Co-authored-by: driazati <driazati@users.noreply.github.com>

This has been working fine for a while, this code opens it up so it's not limited to the authors in apache#9983. Co-authored-by: driazati <driazati@users.noreply.github.com>

This PR introduces remaining IRBuilder methods for `For`. Co-authored-by: yongwww <yongcale@gmail.com>

This change tries to fix an issue due to apache#12515. Previously the logic for `-2147483648` is `parse(-literal)` = `-parse(literal)`, and all integer literals are converted to i32 (either the literal value actually overflow or not). Since after apache#12515, parse `2147483648` results in an i64 typed integer rather than i32, `-2147483648` then becomes an i64 integer too, which is not reasonable.

These couple names were linking to 404 pages, this PR updates them to their current counterparts. Co-authored-by: driazati <driazati@users.noreply.github.com>

Added extra simplify step to eliminate false negative cases.

…2796) This PR introduces a clone function for each of the task-level MetaSchedule classes for convenient class deep copying. - [x] ScheduleRule - [x] Postproc - [x] Mutator - [x] SpaceGenerator - [x] SearchStrategy - [x] TuneContext

This PR finishes migration from `check_trace` (string-based equality check on TIR trace) to `check_sketch` (SEqual-based equality check on TIR). Here, we split multi-level-tiling into 3 files: - Plain multi-level tiling without any intrinsics - Multi-level tiling with intrinsics like VNNI, DP4a - Multi-level tiling with TensorCore which comes with different handling Besides, we cleaned up the testing folder and removed several methods that are no longer useful for unittests.

This PR introduces remaining IRBuilder methods for `Axis`. Co-authored-by: yongwww <yongcale@gmail.com>

These were broken due to this missing guard: https://ci.tlcpack.ai/job/docker-images-ci/job/docker-image-run-tests/223/console Co-authored-by: driazati <driazati@users.noreply.github.com>

…ion (apache#12811) Fix random state fork in TuneContext Clone function.

Recently virtual environments were introduced in the docker images which was a great contribution to localize errors: apache#12663. In this fix, link to the caffe is created inside this virtual env instead of adding it to the system path of python. This fix also removes importing request package where not needed. Fixes apache#12663

apache#12783) [Hexagon] Reduce the number of tests run for VTCM testing in order to speedup CI.

…apache#12807) * Protect access to global buffer manager map * Fix lint

This was missing a repo checkout and failing as in https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/main/4302/pipeline. This also adds in the changes from apache#12719: Fixes apache#12600. The original solution there doesn't actually fix the issue, there would need to be some job queue that could make sure to reject old pushes. Since this case is pretty rare, generally the next commit that comes along and builds will fix everything up so we can ignore failures that happen on `push`es.

This would post the comment that the tests bot and the docs comment bot uses straightaway when a PR is posted. This will contain links to generic info about posting PRs (and obviate the `.github/PULL_REQUEST_TEMPLATE.md`) as well as dynamic info about the specific PR (filled in later by the respective bots). This would make things like the auto-cc bot more transparent since it would have a link to the relevant issue. Tested live here: driazati#21 (comment)

…ache#12778) * [Testing] Add decorator tvm.testing.requires_cuda_compute_version Previously, individual unit tests would call `tvm.contrib.nvcc.get_target_compute_version` and return early. This was repeated boilerplate in many tests, and incorrectly reported a test as `PASSED` if the required infrastructure wasn't present. This commit introduces `tvm.testing.requires_cuda_compute_version`, a decorator that checks the CUDA compute version and applies `pytest.mark.skipif`. If required infrastructure isn't present, a test will be reported as `SKIPPED`. * requires_cuda_compute_version skips test when no GPU is present

* add debug option to hexagon pytest * address comment

* First pass at improving runtime resource management * Add unit test * Fix lint and clang format errors * Disable resource reset for simulator * Moved acquire/release calls to session object, separate buffer managers for non-runtime (static) and runtime (dynamic). * Fix lint errors * Fix lint errors * Improve robustness of session shutdown * Fix lint * Address feedback * Only allow call to Acquire in a clean state * Use a pointer to indicate the "active" manager

This PR introduces remaining IRBuilder methods for `Block`. Co-authored-by: yongwww <yongcale@gmail.com>

…e#12827) This PR introduces two reducers to TIR reduction part, so that rfactor and cross-thread reduction can be applied to those functions who contains argmax/argmin computation generated by TOPI.

Computing the inverse mapping requires arithmetic analysis which is not guaranteed to cover all cases. We provide the pre-defined inverse index map instead.

Prior to this PR, the LCA detector of buffers in TIR didn't take buffer memory scopes and GPU hierarchy into consideration. An consequent issue is that, when an intermediate buffer is in global memory, TIR's lowering passes don't necessarily allocated the intermediate buffer outside all `blockIdx`. As a result, the global intermediate buffer is allocated under a GPU thread block, which is illegal. This PR fixes this issue by fixing the LCA detector, making it be aware of the buffer memory scopes and GPU hierarchy. With this fix, the global intermediate buffers are all allocated outside `blockIdx`.

) This PR is split from apache#12492, to make the necessary updates to the printer infra for future PRs of TIR printer. Tracking issue: apache#11912 Co-authored-by: Greg Bonik <gbonik@octoml.ai>

…e#12825) This PR relaxes the conditions of Meta-Schedule schedule rule CrossThreadReduction. The rules are previously a bit over-strict, and some workloads with small reduction loop length are unable to be optimized by cross-thread reduction automatically. In this PR, we relax the rules so that such workloads can be optimized.

This PR introduces IRBuilder methods for `Assert`, `Let`, `Realize`, `Evaluate`, `LaunchThread`, `EnvThread`. Co-authored-by: yongwww <yongcale@gmail.com>

This PR introduces IRBuilder methods for `allocate`, `Let`, `allocate_const`, `attr`, `While`, `If/Then/Else`, `decl_buffer`, `buffer_store`, `prefetch`. Co-authored-by: yongwww <yongcale@gmail.com>

…trs["force_suppress"] (apache#12593) * [Frontend][TFLite]fix detection_postprocess's non_max_suppression_attrs["force_suppress"] Since tvm only supports operators detection_postprocess use_regular_nms is false, which will suppress boxes that exceed the threshold regardless of the class when implementing NMS in tflite, in order for the results of tvm and tflite to be consistent, we need to set force_suppress to True. * [Frontend][TFLite]fix detection_postprocess's non_max_suppression_attrs[force_suppress] Added a test case that reproduces inconsistent results between tvm and tflite When the force_suppress is false,it will get a good result if you set the force_suppress as true

Implementation of API in `tvm.tir.schedule` for layout transformations with padding, as part of apache#12261, item "Insert pad value into generated TIR, using `tir::if_then_else`, `builtin::assume`, and `builtin::undef`". Following the RFC discussion in apache/tvm-rfcs#77 (comment) and apache/tvm-rfcs#77 (comment), this commit preferentially rewrites the loops that surround a padded transformation where possible, in order to express padding in terms of `tir::if_then_else`.

Commits on Sep 21, 2022

Move to the metaschedule folder

yelite committed Sep 21, 2022

Configuration menu

View commit details

Copy full SHA for 23a6658

Browse repository at this point

Copy the full SHA

23a6658 View commit details

Browse the repository at this point in the history

Commits on Sep 23, 2022

Add rpc config and some documentation

yelite committed Sep 23, 2022

Configuration menu

View commit details

Copy full SHA for 03d630f

Browse repository at this point

Copy the full SHA

03d630f View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torchdynamo tuning script #9

Torchdynamo tuning script #9

Commits on Aug 18, 2022

Commits on Aug 19, 2022

Commits on Aug 20, 2022

Commits on Aug 21, 2022

Commits on Aug 22, 2022

Commits on Aug 23, 2022

Commits on Aug 24, 2022

Commits on Aug 25, 2022

Commits on Aug 26, 2022

Commits on Aug 27, 2022

Commits on Aug 29, 2022

Commits on Aug 30, 2022

Commits on Aug 31, 2022

Commits on Sep 1, 2022

Commits on Sep 2, 2022

Commits on Sep 5, 2022

Commits on Sep 6, 2022

Commits on Sep 7, 2022

Commits on Sep 8, 2022

Commits on Sep 9, 2022

Commits on Sep 10, 2022

Commits on Sep 12, 2022

Commits on Sep 13, 2022

Commits on Sep 14, 2022

Commits on Sep 15, 2022

Commits on Sep 16, 2022

Commits on Sep 17, 2022

Commits on Sep 18, 2022

Commits on Sep 19, 2022

Commits on Sep 20, 2022

Commits on Sep 21, 2022

Commits on Sep 22, 2022

Commits on Sep 23, 2022